this is just GREAT, am not just appreciating the performance but also the refactoring done to Partition and Cusors, this is making implementing a new partition easy (based on what I am experiencing right now while building a Mavibot based partition)
thanks for the hard work and +1 to merge back to trunk. On Wed, Sep 19, 2012 at 8:04 PM, Emmanuel Lécharny <[email protected]> wrote: > Last year in september, I detected a problem when conducting concurrent > searches and modifications [1]. This problem is a real show-stopper. > > The problem was that as we were fetching the entries one by one, using > indexes, if some modification occurs and impacts the index we are using > while the search is being processed, we may not be able to get back the > values we are pointing to in the search. This results to NPE, at best. > > We looked for ways to solve this issue, and a few of them are available. > 1) The first solution would be to implement a in-memory MVCC system, with a > write-ahead log. The idea is to keep in memory all the modified values until > all the pending searches are completed. They can then be removed from > memory. This solution would be cross-partition. > 2) The second solution would be to use MVCC backends. As we don't remove any > element unless they are not used anymore, this is a guarantee that we will > always be able to fetch an element from the backend, even if it has been > modified. We just will get back an old revision. > 3) A third solution, which does not imply that we implement a MVCC global > system, or a MVCC backend, would be to serialize writes and reads. Reads > would be concurrent, and writes can only be done when all the pending reads > and writes are completed. Of course, if a read is received while a write is > being processed, this read will have to wait for the write completion. In > order to keep performance good, we need to speed up the reads so that writes > are not delayed for ever : this is done by computer the set of candidates > immediately, before fetching the entries one by one later. If an entry has > been removed whil we fetch it, then this entry will simply be ignored. > > Those three solutions have pros and cons. We will analyze them in a further > paragraph. > > I wanted to play around the 2nd idea, and as I needed a in-memory MVCC BTree > for that, I wrote a prototype BTree for that purpose (Mavibot [2]). > > While I thought Mavibot was ready (more or less), I created a new branch > (apacheds-mvbt) to experiment the new backend. I had to do some cleanup in > order to be able to use the Mavibot implementation : > > o Use UUID instead of Long for Entry's identifiers > o Simplify the cursor implementation, and use of generic [3] > > > After a few days, I found that some technical choice we have made are > problematic, and I went a bit further. > > 1) The cursor Interface suppose you can move forward and backward. Sadly, > when you have a search using a filter with an 'OR' connector, there is no > way you can move backward. Not that it's really a problem in a standard > usage of LDAP, but that would defeat the VLV control. > > 2) Not fetching entries immediately would potentially save time, if the > entry is not a valid candidate against the filter we use. However, using > indexes to check if an entry is valid or not mans we go down a BTree, which > is also a costly operation. If we have a filter with N indexed attributes, > we will go down N BTrees, for each candidate, instead of fetching the entry > once, and simply validate it against the filters using comparators. > > 3) We were using generics extensively, in a way that makes it extremely > confusing to know what is what (having 3 parameters for an IndexCursor does > not help at all) > > 4) We were using reverse table for every index, even when it was useless. > > 5) The Index initialization was problematic > > 6) Once we are using UUID instead of Long, there is no reason to ue the > entryUUID index anymore, we can do a full scan using the MasterTable > directly > > 7) The Cursor hierarchy was not fully consistent > > 8) Entry were fetched more than once in complex filters, slowing down badly > the procssing of such searches > > 9) Some evaluators could be removed sparing some CPU when the number of > candidate they were selecting was 0 > > 10) Some Or and And cursor could be removed if they were having only one > candidate > > All those items have been fixed in the mvbt-branch. > > Then, before implementing Mavibot, I found that we could implement the 3rd > strategy easily. The idea is to allow concurrent searches, but writes would > be serialized. > > The problem was that with the way we were using cursors, we may browse a > BTree for a very long time - especially when processing a PagedSearch - and > we can't expect the writes to wait until such a search is completed. Another > solution was to gather all the candidates *before* fetching any entry, and > to store their associated UUID into a list of candidates. > > This is not any different from how we were conducting searches with a OR in > the filter : it also gathers all the candidate UUID up to the end of the > search. > > Doing so offers many advantage : > - we can exit the search faster, with all the candidates already selected > (even for a paged search), and the writes can be applied more frequently > - we speedup searches where more than one filter can be used to select the > valid candidates > - now that the set has been created, we can move forward *and* backward, > something we can't do wth the current implementation when we use an OR/AND > connector in the filter. > > There are some cons though : > - gathering candiates UUID may eat memory > - in some case, it may be slower than the current implementation, just > because the number of selected candidate may be way higher than the number > of entries we will return (we pay the cost of extra entry fetch in this > case) > > But even then, we already have to keep a set of candidate UUIDs when doing a > search with an OR/AND connector in the current implementation, and it' snot > proven that fetching an entry is more costly than fetching a <key/value> of > a given index - not to mention that the math may change a lot of we have > more than one Node in the filter- > > Now, in order to fix the concurrent issue we have found last year, we have > to add a couple of ReadWriteLock : > o one in the OperationManager, to forbid writes to be executed while some > searches are handling a ReadLock and to forbid a search to be executed while > a write handles a WriteLock > o another one in the AbstractBtreePartition to protect the master table from > concurrent access, as the searches will directly fetch the entries from the > MasterTable - and the RdnIndex -, even when some writes are updating them. > > The current branch implements this algorithm. It works, and it's fast. I > have tested the concurrent modifications and searches that demonstrated the > issue last year, I was able to run 10 000 loops over 20 threads, each one > doing : > - add cn=test<N> > - search subtree cn=test* > - delete cn= test<N> > where N is the thread number. > > This test was badly failing on trunk (the more threads we have, the more > errors we get), and is now running just fine. > > (The test was failing because while doing the search, we were fetching > entries from the backend, when some deletion were being done at the same > time). > > One more thing : performances. The branch is actually faster than the > current trunk. A search conducted using the SearchPerfIT test. > > Here are the performances I get when using the embedded LDAP server : > > 1.5.0 (FTR) : > - Object Search -> 32 993/s > - OneLevel Search -> 156 935/s > - SubLevel Search -> 137 225/s > > Trunk : > - Object Search -> 57 191/s > - OneLevel Search -> 76 250/s > - SubLevel Search -> 77 200/s > > MVBT branch : > - Object Search -> 76 540/s (+33%/trunk) > - OneLevel Search -> 109 985/s (+44%/trunk) > - SubLevel Search -> 134 750/s (+74%/trunk) > > We get back one entry for the ObjectSearch, 5 for the OneLevel search and 10 > for teh SubLevel search. What we measure is the number of entries we get > back per second. > > The 1.5.0 version is actually faster for OneLevel searches and SubLevel > searches because we were storing the DN in the entry - rebuilding the Dn is > a costly operation -. > > The gain is ever higher when we use a complex filter, due to the numerous > and spurious fetches of the same entry in trunk. > > The same tests but using the network : > MVBT branch : > Object : 2 564/s > One : 17 960/s > Sub : 23 050/s > > Trunk : > Object : 2 567/s > One : 12 555/s > Sub : 21 240/s > > Here, the gain i sless significant, mainly because we have the client and > the server on the same machine, and the messages encoding/decoding is > shadowing the gain in the backend. But still, we see some gap in the > oneLevel and subLevel searches (between 8% up to 43%). > > At this point, I think we have a working server, solving a critical issue we > are fighting with for more than one year now, and it's actually faster than > the current trunk. It also solves some other issues. > > Before continuing with the Mavibot backend implementation - which will > brings some major performance boosts, AFAICT - I could suggest we promote > the current mvbt branch as the trunk, and cut a 2.0.0-M8 release asap. I do > think that it's a better, more stable, and faster working base for the > upcoming versions, and mor eimportant, the 2.0 GA. > > We shoudl also continue the ongoing work on other solutions, as they offer > some extra advantages, but we may have them in a 2.1 or a later version. > > so, wdyt ? > > -- > Regards, > Cordialement, > Emmanuel Lécharny > www.iktek.com > -- Kiran Ayyagari http://keydap.com
