Re: Current status of the server, and mvbt-branch status

Kiran Ayyagari Wed, 19 Sep 2012 10:31:01 -0700

this is just GREAT, am not just appreciating the performance but also
the refactoring
done to Partition and Cusors, this is making implementing a new partition easy
(based on what I am experiencing right now while building a Mavibot
based partition)


thanks for the hard work and +1 to merge back to trunk.

On Wed, Sep 19, 2012 at 8:04 PM, Emmanuel Lécharny <[email protected]> wrote:
> Last year in september, I detected a problem when conducting concurrent
> searches and modifications [1]. This problem is a real show-stopper.
>
> The problem was that as we were fetching the entries one by one, using
> indexes, if some modification occurs and impacts the index we are using
> while the search is being processed, we may not be able to get back the
> values we are pointing to in the search. This results to NPE, at best.
>
> We looked for ways to solve this issue, and a few of them are available.
> 1) The first solution would be to implement a in-memory MVCC system, with a
> write-ahead log. The idea is to keep in memory all the modified values until
> all the pending searches are completed. They can then be removed from
> memory. This solution would be cross-partition.
> 2) The second solution would be to use MVCC backends. As we don't remove any
> element unless they are not used anymore, this is a guarantee that we will
> always be able to fetch an element from the backend, even if it has been
> modified. We just will get back an old revision.
> 3) A third solution, which does not imply that we implement a MVCC global
> system, or a MVCC backend, would be to serialize writes and reads. Reads
> would be concurrent, and writes can only be done when all the pending reads
> and writes are completed. Of course, if a read is received while a write is
> being processed, this read will have to wait for the write completion. In
> order to keep performance good, we need to speed up the reads so that writes
> are not delayed for ever : this is done by computer the set of candidates
> immediately, before fetching the entries one by one later. If an entry has
> been removed whil we fetch it, then this entry will simply be ignored.
>
> Those three solutions have pros and cons. We will analyze them in a further
> paragraph.
>
> I wanted to play around the 2nd idea, and as I needed a in-memory MVCC BTree
> for that, I wrote a prototype BTree for that purpose (Mavibot [2]).
>
> While I thought Mavibot was ready (more or less), I created a new branch
> (apacheds-mvbt) to experiment the new backend. I had to do some cleanup in
> order to be able to use the Mavibot implementation :
>
> o Use UUID instead of Long for Entry's identifiers
> o Simplify the cursor implementation, and use of generic [3]
>
>
> After a few days, I found that some technical choice we have made are
> problematic, and I went a bit further.
>
> 1) The cursor Interface suppose you can move forward and backward. Sadly,
> when you have a search using a filter with an 'OR' connector, there is no
> way you can move backward. Not that it's really a problem in a standard
> usage of LDAP, but that would defeat the VLV control.
>
> 2) Not fetching entries immediately would potentially save time, if the
> entry is not a valid candidate against the filter we use. However, using
> indexes to check if an entry is valid or not mans we go down a BTree, which
> is also a costly operation. If we have a filter with N indexed attributes,
> we will go down N BTrees, for each candidate, instead of fetching the entry
> once, and simply validate it against the filters using comparators.
>
> 3) We were using generics extensively, in a way that makes it extremely
> confusing to know what is what (having 3 parameters for an IndexCursor does
> not help at all)
>
> 4) We were using reverse table for every index, even when it was useless.
>
> 5) The Index initialization was problematic
>
> 6) Once we are using UUID instead of Long, there is no reason to ue the
> entryUUID index anymore, we can do a full scan using the MasterTable
> directly
>
> 7) The Cursor hierarchy was not fully consistent
>
> 8) Entry were fetched more than once in complex filters, slowing down badly
> the procssing of such searches
>
> 9) Some evaluators could be removed sparing some CPU when the number of
> candidate they were selecting was 0
>
> 10) Some Or and And cursor could be removed if they were having only one
> candidate
>
> All those items have been fixed in the mvbt-branch.
>
> Then, before implementing Mavibot, I found that we could implement the 3rd
> strategy easily. The idea is to allow concurrent searches, but writes would
> be serialized.
>
> The problem was that with the way we were using cursors, we may browse a
> BTree for a very long time - especially when processing a PagedSearch - and
> we can't expect the writes to wait until such a search is completed. Another
> solution was to gather all the candidates *before* fetching any entry, and
> to store their associated UUID into a list of candidates.
>
> This is not any different from how we were conducting searches with a OR in
> the filter : it also gathers all the candidate UUID up to the end of the
> search.
>
> Doing so offers many advantage :
> - we can exit the search faster, with all the candidates already selected
> (even for a paged search), and the writes can be applied more frequently
> - we speedup searches where more than one filter can be used to select the
> valid candidates
> - now that the set has been created, we can move forward *and* backward,
> something we can't do wth the current implementation when we use an OR/AND
> connector in the filter.
>
> There are some cons though :
> - gathering candiates UUID may eat memory
> - in some case, it may be slower than the current implementation, just
> because the number of selected candidate may be way higher than the number
> of entries we will return (we pay the cost of extra entry fetch in this
> case)
>
> But even then, we already have to keep a set of candidate UUIDs when doing a
> search with an OR/AND connector in the current implementation, and it' snot
> proven that fetching an entry is more costly than fetching a <key/value> of
> a given index - not to mention that the math may change a lot of we have
> more than one Node in the filter-
>
> Now, in order to fix the concurrent issue we have found last year, we have
> to add a couple of ReadWriteLock :
> o one in the OperationManager, to forbid writes to be executed while some
> searches are handling a ReadLock and to forbid a search to be executed while
> a write handles a WriteLock
> o another one in the AbstractBtreePartition to protect the master table from
> concurrent access, as the searches will directly fetch the entries from the
> MasterTable - and the RdnIndex -, even when some writes are updating them.
>
> The current branch implements this algorithm. It works, and it's fast. I
> have tested the concurrent modifications and searches that demonstrated the
> issue last year, I was able to run 10 000 loops over 20 threads, each one
> doing :
> - add cn=test<N>
> - search subtree cn=test*
> - delete cn= test<N>
> where N is the thread number.
>
> This test was badly failing on trunk (the more threads we have, the more
> errors we get), and is now running just fine.
>
> (The test was failing because while doing the search, we were fetching
> entries from the backend, when some deletion were being done at the same
> time).
>
> One more thing : performances. The branch is actually faster than the
> current trunk. A search conducted using the SearchPerfIT test.
>
> Here are the performances I get when using the embedded LDAP server :
>
> 1.5.0 (FTR) :
> - Object   Search ->  32 993/s
> - OneLevel Search -> 156 935/s
> - SubLevel Search -> 137 225/s
>
> Trunk :
> - Object   Search ->  57 191/s
> - OneLevel Search ->  76 250/s
> - SubLevel Search ->  77 200/s
>
> MVBT branch :
> - Object   Search ->  76 540/s (+33%/trunk)
> - OneLevel Search -> 109 985/s (+44%/trunk)
> - SubLevel Search -> 134 750/s (+74%/trunk)
>
> We get back one entry for the ObjectSearch, 5 for the OneLevel search and 10
> for teh SubLevel search. What we measure is the number of entries we get
> back per second.
>
> The 1.5.0 version is actually faster for OneLevel searches and SubLevel
> searches because we were storing the DN in the entry - rebuilding the Dn is
> a costly operation -.
>
> The gain is ever higher when we use a complex filter, due to the numerous
> and spurious fetches of the same entry in trunk.
>
> The same tests but using the network :
> MVBT branch :
> Object :  2 564/s
> One    : 17 960/s
> Sub    : 23 050/s
>
> Trunk :
> Object :  2 567/s
> One    : 12 555/s
> Sub    : 21 240/s
>
> Here, the gain i sless significant, mainly because we have the client and
> the server on the same machine, and the messages encoding/decoding is
> shadowing the gain in the backend. But still, we see some gap in the
> oneLevel and subLevel searches (between 8% up to 43%).
>
> At this point, I think we have a working server, solving a critical issue we
> are fighting with for more than one year now, and it's actually faster than
> the current trunk. It also solves some other issues.
>
> Before continuing with the Mavibot backend implementation - which will
> brings some major performance boosts, AFAICT - I could suggest we promote
> the current mvbt branch as the trunk, and cut a 2.0.0-M8 release asap.  I do
> think that it's a better, more stable, and faster working base for the
> upcoming versions, and mor eimportant, the 2.0 GA.
>
> We shoudl also continue the ongoing work on other solutions, as they offer
> some extra advantages, but we may have them in a 2.1 or a later version.
>
> so, wdyt ?
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>



-- 
Kiran Ayyagari
http://keydap.com

Re: Current status of the server, and mvbt-branch status

Reply via email to