Charan, Thank you for sharing. I just took a quick glance. I have a few high-level questions:
- Can you benchmark this change when you have time? It is interesting to see: (1) current logic vs new logic with 1 slot (2) current logic vs new logic with N slot repeat the above tests for - both interleaved ledger storage and sorted ledger storage. - increasing number of ledgers from 1, 10, 1000, 10000 to 100000. - Your approach might be helpful for using interleaved ledger storage directly, but it doesn't really help sorted ledger storage or any ledger storage having a sorted buffer over interleaved ledger storage. because at any given time, there is only one flush thread writing to entrylogger, especially if you have small number of ledgers, when flushing the memtable, it is effectively only one entry log file is written at that given time. - If this is done using a composite ledger store approach, it will benefit the users of sorted ledger storage because you will have multiple flush threads that write to different entry log files concurrently. Because you already have most of parts addressed in the multi-entrylogger-files approach, I would like to see how's the performance in this approach. Other comments inline: On Wed, Jul 19, 2017 at 7:31 AM, Charan Reddy G <reddychara...@gmail.com> wrote: > *"I am wondering if your multiple-entrylogs approach is making things > complicated. I have been thinking there can be a simpler approach achieving > the same goal: for example, having a ledger storage comprised of N > interleaved/sorted ledger storages, which they share same LedgerCache, but > having different memtables (for sortedledgerstore) and different entry log > files. "* > > > First of all I'm not convinced with the statement that dealing > MultipleEntryLogs at EntryLogger is more complicated. As far as I > understood, dealing with composition (containing N) at LedgerStorage level > is even more complicated than dealing at EntryLogger > > - The crux of the complication in the approach described in > https://issues.apache.org/jira/browse/BOOKKEEPER-1041 is maintenance of > state information (mapping of ledgerid to slotid) and in the event of > ledgerdir becoming full updating that mapping to available slot. The same > amount of complication will be applicable even in the case of composition > at LedgerStorage level. When we get addEntry at ComposedLedgerStorage there > should be logic to which LedgerStorage it should goto. So with this fact I > can say that composing at LedgerStorage level it will be atleast as > complicated as composing at EntryLogger level. > > - with the ComposedLedgerStorage approach (Composite Pattern), > ComposedLedgerStorage managing N > InterleavedLedgerStorage/SortedLedgerStorages, > there is going to be considerable change in the way the resources and state > information are handled. Each Interleaved/SortedLedgerStorage will have > its own EntryLogger (to serve our MultipleEntryLogs feature), but rest all > needs to be shared, mainly 'LedgerCache', 'gcThread' and 'activeLedgers'. > So there is going to be major churn in the code for moving the operations > dealing with shared resources and state to ComposedLedgerStorage and leave > the rest in InterleavedLedgerStorage. Amount of changes required in > testcode to deal with these changes is even more. We have to do this while > providing backward compatibility (Single EntryLog). Now this should make > one question where should the composition happen. > A very dump implementation to start - you might be just implementing a ledger storage with N ledger storages, each ledger storage is a subdirectory in the ledger directories. So at least you can compare this approach with your multi-entry-loggers approach to see which one can gain better performance. Because I remember the goal for multi-entry-loggers is for latency and performance. We need to compare to show the evidence which approach is better. > As far as I can say, it is supposed to happen at the lowest possible level > rather than at higher level, which needs extra band-aid efforts to deal > separately for the common resources/state and multiplexable resources. > I think this is arguable. if you stripe the entries across multiple entry log files, doing this at entry log file level will give you higher bandwidth at writes. but your data locality will be poor. if you don't stripe the entries (that means entries for a ledger going to one entry log files), doing this at entry log file level doesn't help ledger storages that have high level buffer and caches. > > - with composition at EntryLogger level, currently in my implementation > getEntry/readEntry path is mostly unchanged. but with composition at > LedgerStorage even for readEntry/getEntry the multiplexing/redirection > needs to happen. > > - and mainly LedgerStorage is not the only consumer of EntryLogger, but > also GarbageCollectorThread. It calls quite a few methods of EntryLogger - > (entryLogger.addEntry, entryLogger.flush, entryLogger.removeEntryLog, > entryLogger.scanEntryLog, entryLogger.getLeastUnflushedLogId, > entryLogger.logExists, entryLogger.getEntryLogMetadata). It is going to > be an issue with ComposedLedgerStorage, because with that EntryLogger is > going to be responsible for just one EntryLog. For GarbageCollectorThread > to work correctly with multiple EntryLogs, composition is required here as > well. Which is double whammy when it comes down to implementation. > I think this depends on how do you do to compose ledger storages. GCThread is treated as part of ledger storage. If composition happens out side of ledger storage, the complexity of ledger storage is hidden. If we do composition out side of LedgerStorage, we can have another benefit - we can have a ledger storage that is composed with different ledger storage implementation. then we can some placement strategy to place different ledgers (for different workloads) in different ledger storages. > > - I remember Sijie mentioning that there is WI to improve > GarbageCollectorThread compaction logic, to segregate the entries of > ledgers based on lifespan while compacting entrylogs. > Yiming already did the change. I need to check if it is available for porting it back to the community. > With the correct implementation of MultipleEntryLogs, the > logic/implementation of it shouldn't be affected, but with > ComposedLedgerStorage it will have the same issues as I mentioned in the > above point. > I don't see it will be problem if you compose at the right level. > > > Being said that, one difference what ComposedLedgerStorage could make is > each SortedLedgerStorage will have its own EntryMemtable. With the current > implementation of MultipleEntryLogs, it is going to be just one MemTable. > Now the question is how is going to be beneficial to have multiple > EntryMemTables? > > Andrey mentioned that current EntrymemTable's snapshot flush processes > entries sequentially, but this can be parallelized by having > ThreadPoolExecutor for processing the entries > if you want to paralleize the flushing, isn't having multiple memtables a better approach? less synchronization to worry about. > and also Sijie confirmed that synchrnoization is not needed for > InterleavedLedgerStorage.processEntry and InterleavedLedgerStorage.addEntry. > Size of the EntryMemTable is already configurable. > > And mainly with MultipleEntryLogs there is not much need of > SortedLedgerStorage, because entries of different ledgers are inherently > segregated and there wont be much Interleaving even if we use > InterleavedLedgerStorage. > It depends on the workload. The existence of sorted ledger storage is also for I/O isolation. If you just use interleaved ledger storage, you are facing the effects taking from filesystem page cache. > Also, InterleavedLedgerStorage helps in overcoming unpredictable tail > latency caused by SortedLedgerStorage' EntryMemTable flush. > I think the argument about InterleavedLedgerStorage with SortedLedgerStorage has brought this topic to a different direction. The assumptions that you are making here is getting rid of SortedLedgerStorage, it might be true for your workload but it won't be true for other people's use case. I would like to see a solution that takes more considerations here. Also I am not convinced the InterleavedLedgerStorage helps in overcoming unpredictable tail latency. Because in theory, there are buffers in both interleaved ledger storage and sorted ledger storage. The unpredictable tail latency comes from the mismatch between your write rate and flush rate, it can also happen in interleaved ledger storage. If you lower the memtable size in sorted ledger storage, you will get better predictable tail latency because you are flushing more frequent and less data to write when flushing. The composition approach is effectively having more smaller buffers, which it also help overcome the tail latency. > > @sijie > Code is almost ready for pull request, I just need to do some final > touches, in the meantime you may check the code at - https://github.com/ > reddycharan/bookkeeper/tree/multipleentrylogs . It is rebased to current > code. It should be good. > I will take closer look at the code. But I would also like to see any benchmark numbers (it can be just benchmarking ledger storage itself). > > Thanks, > Charan > > On Mon, Jul 17, 2017 at 2:25 PM, Venkateswara Rao Jujjuri < > jujj...@gmail.com> wrote: > >> >> >> On Fri, Jul 14, 2017 at 6:00 PM, Sijie Guo <guosi...@gmail.com> wrote: >> >>> >>> >>> On Sat, Jul 15, 2017 at 8:06 AM, Charan Reddy G <reddychara...@gmail.com >>> > wrote: >>> >>>> Hey, >>>> >>>> In InterleavedLedgerStorage, since the initial version of it ( >>>> https://github.com/apache/bookkeeper/commit/4a94ce1d8184f5f >>>> 38def015d80777a8113b96690 and https://github.com/apache/book >>>> keeper/commit/d175ada58dcaf78f0a70b0ebebf489255ae67b5f), addEntry and >>>> processEntry methods are synchronized. If it is synchronized then I dont >>>> get what is the point in having 'writeThreadPool' in >>>> BookieRequestProcessor, if anyhow they are going to be executed >>>> sequentially because of synchronized addEntry method in >>>> InterleavedLedgerStorage. >>>> >>> >>> When InterleavedLedgerStore is used in the context of SortedLedgerStore, >>> the addEntry and processEntry are only called when flushing happened. The >>> flushing happens in background thread, which is effectively running >>> sequentially. But adding to the memtable happens concurrently. >>> >>> The reason of having 'writeThreadPool' is more on separating writes and >>> reads into different thread pools. so writes will not be affected by reads. >>> In the context of SortedLedgerStore, the 'writeThreadPool' adds the >>> concurrency. >>> >>> >>>> >>>> If we look at the implementation of addEntry and processEntry method, >>>> 'somethingWritten' can be made threadsafe by using AtomicBoolean, >>>> ledgerCache.updateLastAddConfirmed and entryLogger.addEntry methods >>>> are inherently threadsafe. >>>> >>>> I'm not sure about semantics of ledgerCache.putEntryOffset method here. >>>> I'm not confident enough to say if LedgerCacheImpl and IndexInMemPageMgr >>>> (and probably IndexPersistenceMgr) are thread-safe classes. >>>> >>> >>> LedgerCacheImpl and IndexInMemPageMgr are thread-safe classes. You can >>> confirm this from SortedLedgerStorage. >>> >>> >>>> >>>> As far as I understood, if ledgerCache.putEntryOffset is thread safe, >>>> then I dont see the need of synchronization for those methods. In any case, >>>> if they are not thread-safe can you please say why it is not thread-safe >>>> and how we can do more granular synchronization at LedgerCacheImpl level, >>>> so that we can remove the need of synchrnoization at >>>> InterleavedLedgerStorage level. >>>> >>> >>> I don't see any reason why we can't remove the synchronization. >>> >>> >>>> >>>> I'm currently working on Multiple Entrylogs - >>>> https://issues.apache.org/jira/browse/BOOKKEEPER-1041. >>>> >>> >>> I am wondering if your multiple-entrylogs approach is making things >>> complicated. I have been thinking there can be a simpler approach achieving >>> the same goal: for example, having a ledger storage comprised of N >>> interleaved/sorted ledger storages, which they share same LedgerCache, but >>> having different memtables (for sortedledgerstore) and different entry log >>> files. >>> >> >> This is more cleaner approach. @charan can you comment? >> >> JV >> >> >>> >>> >>>> To reap the benefits of multipleentrylogs feature from performance >>>> perspective, this synchrnoization should be taken care or atleast bring it >>>> down to more granular synchronization (if possible). >>>> >>>> @Override >>>> synchronized public long addEntry(ByteBuffer entry) throws >>>> IOException { >>>> long ledgerId = entry.getLong(); >>>> long entryId = entry.getLong(); >>>> long lac = entry.getLong(); >>>> entry.rewind(); >>>> processEntry(ledgerId, entryId, entry); >>>> ledgerCache.updateLastAddConfirmed(ledgerId, lac); >>>> return entryId; >>>> } >>>> >>>> synchronized protected void processEntry(long ledgerId, long >>>> entryId, ByteBuffer entry, boolean rollLog) >>>> throws IOException { >>>> somethingWritten = true; >>>> long pos = entryLogger.addEntry(ledgerId, entry, rollLog); >>>> ledgerCache.putEntryOffset(ledgerId, entryId, pos); >>>> } >>>> >>>> Thanks, >>>> Charan >>>> >>> >>> >> >> >> -- >> Jvrao >> --- >> First they ignore you, then they laugh at you, then they fight you, then >> you win. - Mahatma Gandhi >> >> >> >