Current status of the server, and mvbt-branch status

Emmanuel Lécharny Wed, 19 Sep 2012 07:34:49 -0700

Last year in september, I detected a problem when conducting concurrentsearches and modifications [1]. This problem is a real show-stopper.

The problem was that as we were fetching the entries one by one, usingindexes, if some modification occurs and impacts the index we are usingwhile the search is being processed, we may not be able to get back thevalues we are pointing to in the search. This results to NPE, at best.


We looked for ways to solve this issue, and a few of them are available.

1) The first solution would be to implement a in-memory MVCC system,with a write-ahead log. The idea is to keep in memory all the modifiedvalues until all the pending searches are completed. They can then beremoved from memory. This solution would be cross-partition.2) The second solution would be to use MVCC backends. As we don't removeany element unless they are not used anymore, this is a guarantee thatwe will always be able to fetch an element from the backend, even if ithas been modified. We just will get back an old revision.3) A third solution, which does not imply that we implement a MVCCglobal system, or a MVCC backend, would be to serialize writes andreads. Reads would be concurrent, and writes can only be done when allthe pending reads and writes are completed. Of course, if a read isreceived while a write is being processed, this read will have to waitfor the write completion. In order to keep performance good, we need tospeed up the reads so that writes are not delayed for ever : this isdone by computer the set of candidates immediately, before fetching theentries one by one later. If an entry has been removed whil we fetch it,then this entry will simply be ignored.

Those three solutions have pros and cons. We will analyze them in afurther paragraph.

I wanted to play around the 2nd idea, and as I needed a in-memory MVCCBTree for that, I wrote a prototype BTree for that purpose (Mavibot [2]).

While I thought Mavibot was ready (more or less), I created a new branch(apacheds-mvbt) to experiment the new backend. I had to do some cleanupin order to be able to use the Mavibot implementation :


o Use UUID instead of Long for Entry's identifiers
o Simplify the cursor implementation, and use of generic [3]

After a few days, I found that some technical choice we have made areproblematic, and I went a bit further.

1) The cursor Interface suppose you can move forward and backward.Sadly, when you have a search using a filter with an 'OR' connector,there is no way you can move backward. Not that it's really a problem ina standard usage of LDAP, but that would defeat the VLV control.

2) Not fetching entries immediately would potentially save time, if theentry is not a valid candidate against the filter we use. However, usingindexes to check if an entry is valid or not mans we go down a BTree,which is also a costly operation. If we have a filter with N indexedattributes, we will go down N BTrees, for each candidate, instead offetching the entry once, and simply validate it against the filtersusing comparators.

3) We were using generics extensively, in a way that makes it extremelyconfusing to know what is what (having 3 parameters for an IndexCursordoes not help at all)


4) We were using reverse table for every index, even when it was useless.

5) The Index initialization was problematic

6) Once we are using UUID instead of Long, there is no reason to ue theentryUUID index anymore, we can do a full scan using the MasterTabledirectly


7) The Cursor hierarchy was not fully consistent

8) Entry were fetched more than once in complex filters, slowing downbadly the procssing of such searches

9) Some evaluators could be removed sparing some CPU when the number ofcandidate they were selecting was 0

10) Some Or and And cursor could be removed if they were having only onecandidate


All those items have been fixed in the mvbt-branch.

Then, before implementing Mavibot, I found that we could implement the3rd strategy easily. The idea is to allow concurrent searches, butwrites would be serialized.

The problem was that with the way we were using cursors, we may browse aBTree for a very long time - especially when processing a PagedSearch -and we can't expect the writes to wait until such a search is completed.Another solution was to gather all the candidates *before* fetching anyentry, and to store their associated UUID into a list of candidates.

This is not any different from how we were conducting searches with a ORin the filter : it also gathers all the candidate UUID up to the end ofthe search.


Doing so offers many advantage :

- we can exit the search faster, with all the candidates alreadyselected (even for a paged search), and the writes can be applied morefrequently- we speedup searches where more than one filter can be used to selectthe valid candidates- now that the set has been created, we can move forward *and* backward,something we can't do wth the current implementation when we use anOR/AND connector in the filter.


There are some cons though :
- gathering candiates UUID may eat memory

- in some case, it may be slower than the current implementation, justbecause the number of selected candidate may be way higher than thenumber of entries we will return (we pay the cost of extra entry fetchin this case)

But even then, we already have to keep a set of candidate UUIDs whendoing a search with an OR/AND connector in the current implementation,and it' snot proven that fetching an entry is more costly than fetchinga <key/value> of a given index - not to mention that the math may changea lot of we have more than one Node in the filter-

Now, in order to fix the concurrent issue we have found last year, wehave to add a couple of ReadWriteLock :o one in the OperationManager, to forbid writes to be executed whilesome searches are handling a ReadLock and to forbid a search to beexecuted while a write handles a WriteLocko another one in the AbstractBtreePartition to protect the master tablefrom concurrent access, as the searches will directly fetch the entriesfrom the MasterTable - and the RdnIndex -, even when some writes areupdating them.

The current branch implements this algorithm. It works, and it's fast. Ihave tested the concurrent modifications and searches that demonstratedthe issue last year, I was able to run 10 000 loops over 20 threads,each one doing :

- add cn=test<N>
- search subtree cn=test*
- delete cn= test<N>
where N is the thread number.

This test was badly failing on trunk (the more threads we have, the moreerrors we get), and is now running just fine.

(The test was failing because while doing the search, we were fetchingentries from the backend, when some deletion were being done at the sametime).

One more thing : performances. The branch is actually faster than thecurrent trunk. A search conducted using the SearchPerfIT test.


Here are the performances I get when using the embedded LDAP server :

1.5.0 (FTR) :
- Object   Search ->  32 993/s
- OneLevel Search -> 156 935/s
- SubLevel Search -> 137 225/s

Trunk :
- Object   Search ->  57 191/s
- OneLevel Search ->  76 250/s
- SubLevel Search ->  77 200/s

MVBT branch :
- Object   Search ->  76 540/s (+33%/trunk)
- OneLevel Search -> 109 985/s (+44%/trunk)
- SubLevel Search -> 134 750/s (+74%/trunk)

We get back one entry for the ObjectSearch, 5 for the OneLevel searchand 10 for teh SubLevel search. What we measure is the number of entrieswe get back per second.

The 1.5.0 version is actually faster for OneLevel searches and SubLevelsearches because we were storing the DN in the entry - rebuilding the Dnis a costly operation -.

The gain is ever higher when we use a complex filter, due to thenumerous and spurious fetches of the same entry in trunk.


The same tests but using the network :
MVBT branch :
Object :  2 564/s
One    : 17 960/s
Sub    : 23 050/s

Trunk :
Object :  2 567/s
One    : 12 555/s
Sub    : 21 240/s

Here, the gain i sless significant, mainly because we have the clientand the server on the same machine, and the messages encoding/decodingis shadowing the gain in the backend. But still, we see some gap in theoneLevel and subLevel searches (between 8% up to 43%).

At this point, I think we have a working server, solving a criticalissue we are fighting with for more than one year now, and it's actuallyfaster than the current trunk. It also solves some other issues.

Before continuing with the Mavibot backend implementation - which willbrings some major performance boosts, AFAICT - I could suggest wepromote the current mvbt branch as the trunk, and cut a 2.0.0-M8 releaseasap. I do think that it's a better, more stable, and faster workingbase for the upcoming versions, and mor eimportant, the 2.0 GA.

We shoudl also continue the ongoing work on other solutions, as theyoffer some extra advantages, but we may have them in a 2.1 or a laterversion.


so, wdyt ?

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Current status of the server, and mvbt-branch status

Reply via email to