Last year in september, I detected a problem when conducting concurrent searches and modifications [1]. This problem is a real show-stopper.

The problem was that as we were fetching the entries one by one, using indexes, if some modification occurs and impacts the index we are using while the search is being processed, we may not be able to get back the values we are pointing to in the search. This results to NPE, at best.

We looked for ways to solve this issue, and a few of them are available.
1) The first solution would be to implement a in-memory MVCC system, with a write-ahead log. The idea is to keep in memory all the modified values until all the pending searches are completed. They can then be removed from memory. This solution would be cross-partition. 2) The second solution would be to use MVCC backends. As we don't remove any element unless they are not used anymore, this is a guarantee that we will always be able to fetch an element from the backend, even if it has been modified. We just will get back an old revision. 3) A third solution, which does not imply that we implement a MVCC global system, or a MVCC backend, would be to serialize writes and reads. Reads would be concurrent, and writes can only be done when all the pending reads and writes are completed. Of course, if a read is received while a write is being processed, this read will have to wait for the write completion. In order to keep performance good, we need to speed up the reads so that writes are not delayed for ever : this is done by computer the set of candidates immediately, before fetching the entries one by one later. If an entry has been removed whil we fetch it, then this entry will simply be ignored.

Those three solutions have pros and cons. We will analyze them in a further paragraph.

I wanted to play around the 2nd idea, and as I needed a in-memory MVCC BTree for that, I wrote a prototype BTree for that purpose (Mavibot [2]).

While I thought Mavibot was ready (more or less), I created a new branch (apacheds-mvbt) to experiment the new backend. I had to do some cleanup in order to be able to use the Mavibot implementation :

o Use UUID instead of Long for Entry's identifiers
o Simplify the cursor implementation, and use of generic [3]


After a few days, I found that some technical choice we have made are problematic, and I went a bit further.

1) The cursor Interface suppose you can move forward and backward. Sadly, when you have a search using a filter with an 'OR' connector, there is no way you can move backward. Not that it's really a problem in a standard usage of LDAP, but that would defeat the VLV control.

2) Not fetching entries immediately would potentially save time, if the entry is not a valid candidate against the filter we use. However, using indexes to check if an entry is valid or not mans we go down a BTree, which is also a costly operation. If we have a filter with N indexed attributes, we will go down N BTrees, for each candidate, instead of fetching the entry once, and simply validate it against the filters using comparators.

3) We were using generics extensively, in a way that makes it extremely confusing to know what is what (having 3 parameters for an IndexCursor does not help at all)

4) We were using reverse table for every index, even when it was useless.

5) The Index initialization was problematic

6) Once we are using UUID instead of Long, there is no reason to ue the entryUUID index anymore, we can do a full scan using the MasterTable directly

7) The Cursor hierarchy was not fully consistent

8) Entry were fetched more than once in complex filters, slowing down badly the procssing of such searches

9) Some evaluators could be removed sparing some CPU when the number of candidate they were selecting was 0

10) Some Or and And cursor could be removed if they were having only one candidate

All those items have been fixed in the mvbt-branch.

Then, before implementing Mavibot, I found that we could implement the 3rd strategy easily. The idea is to allow concurrent searches, but writes would be serialized.

The problem was that with the way we were using cursors, we may browse a BTree for a very long time - especially when processing a PagedSearch - and we can't expect the writes to wait until such a search is completed. Another solution was to gather all the candidates *before* fetching any entry, and to store their associated UUID into a list of candidates.

This is not any different from how we were conducting searches with a OR in the filter : it also gathers all the candidate UUID up to the end of the search.

Doing so offers many advantage :
- we can exit the search faster, with all the candidates already selected (even for a paged search), and the writes can be applied more frequently - we speedup searches where more than one filter can be used to select the valid candidates - now that the set has been created, we can move forward *and* backward, something we can't do wth the current implementation when we use an OR/AND connector in the filter.

There are some cons though :
- gathering candiates UUID may eat memory
- in some case, it may be slower than the current implementation, just because the number of selected candidate may be way higher than the number of entries we will return (we pay the cost of extra entry fetch in this case)

But even then, we already have to keep a set of candidate UUIDs when doing a search with an OR/AND connector in the current implementation, and it' snot proven that fetching an entry is more costly than fetching a <key/value> of a given index - not to mention that the math may change a lot of we have more than one Node in the filter-

Now, in order to fix the concurrent issue we have found last year, we have to add a couple of ReadWriteLock : o one in the OperationManager, to forbid writes to be executed while some searches are handling a ReadLock and to forbid a search to be executed while a write handles a WriteLock o another one in the AbstractBtreePartition to protect the master table from concurrent access, as the searches will directly fetch the entries from the MasterTable - and the RdnIndex -, even when some writes are updating them.

The current branch implements this algorithm. It works, and it's fast. I have tested the concurrent modifications and searches that demonstrated the issue last year, I was able to run 10 000 loops over 20 threads, each one doing :
- add cn=test<N>
- search subtree cn=test*
- delete cn= test<N>
where N is the thread number.

This test was badly failing on trunk (the more threads we have, the more errors we get), and is now running just fine.

(The test was failing because while doing the search, we were fetching entries from the backend, when some deletion were being done at the same time).

One more thing : performances. The branch is actually faster than the current trunk. A search conducted using the SearchPerfIT test.

Here are the performances I get when using the embedded LDAP server :

1.5.0 (FTR) :
- Object   Search ->  32 993/s
- OneLevel Search -> 156 935/s
- SubLevel Search -> 137 225/s

Trunk :
- Object   Search ->  57 191/s
- OneLevel Search ->  76 250/s
- SubLevel Search ->  77 200/s

MVBT branch :
- Object   Search ->  76 540/s (+33%/trunk)
- OneLevel Search -> 109 985/s (+44%/trunk)
- SubLevel Search -> 134 750/s (+74%/trunk)

We get back one entry for the ObjectSearch, 5 for the OneLevel search and 10 for teh SubLevel search. What we measure is the number of entries we get back per second.

The 1.5.0 version is actually faster for OneLevel searches and SubLevel searches because we were storing the DN in the entry - rebuilding the Dn is a costly operation -.

The gain is ever higher when we use a complex filter, due to the numerous and spurious fetches of the same entry in trunk.

The same tests but using the network :
MVBT branch :
Object :  2 564/s
One    : 17 960/s
Sub    : 23 050/s

Trunk :
Object :  2 567/s
One    : 12 555/s
Sub    : 21 240/s

Here, the gain i sless significant, mainly because we have the client and the server on the same machine, and the messages encoding/decoding is shadowing the gain in the backend. But still, we see some gap in the oneLevel and subLevel searches (between 8% up to 43%).

At this point, I think we have a working server, solving a critical issue we are fighting with for more than one year now, and it's actually faster than the current trunk. It also solves some other issues.

Before continuing with the Mavibot backend implementation - which will brings some major performance boosts, AFAICT - I could suggest we promote the current mvbt branch as the trunk, and cut a 2.0.0-M8 release asap. I do think that it's a better, more stable, and faster working base for the upcoming versions, and mor eimportant, the 2.0 GA.

We shoudl also continue the ongoing work on other solutions, as they offer some extra advantages, but we may have them in a 2.1 or a later version.

so, wdyt ?

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Reply via email to