RE: JdbmPartition repair

Zheng, Kai Mon, 25 Jan 2016 06:28:13 -0800

Thanks a lot for the detailed and insightful explanation. I'm not able to 
absorb it well because not familiar with the codes. It will serve as very good 
materials when someday I need to look into the LDAP things. The details make me 
believe it's very necessary to have a strong, mature, industry proven backend 
for the LDAP server because the LDAP things are already kinds of complex 
enough. We can't combine the LDAP logic with the storage engine, they need to 
be separated, developed and tested separately. Looks like Mavibot is going in 
this way and sounds good to me. What concerned me is that as we're lacking 
enough resources on developing it, it may still take some time to become mature 
and robust. But if we leverage some existing engine, then we can focus on the 
LDAP stuffs and work on some advanced features, move on a little faster and 
have releases like 2.x, 3.x and so on. Sqlite yes is C, but it's supported on 
many platforms and Java can use it via JNI; it's a library, can be embedded in 
an application. You may dislike JNI, but only a few of APIs are going to be 
wrapped for the usage, and actually there're already wonderful wrappers for 
Java. Like SnappyJava, the JNI layer along with the library can be bundled 
within a jar file and get distributed exactly as a maven module. One thing I'm 
not sure is how well the LDAP entries fit with the sql table model, but I guess 
there could be pretty much investigations in this direction. The benefit would 
be, saving us amounts of developing and debugging time, robust and high 
performance, transaction support and easy query. Some thoughts in case any 
helps. Thanks.

Regards,
Kai

-----Original Message-----
From: Emmanuel Lécharny [mailto:[email protected]] 
Sent: Monday, January 25, 2016 1:32 AM
To: Apache Directory Developers List <[email protected]>
Subject: Re: JdbmPartition repair

Le 24/01/16 16:47, Zheng, Kai a écrit :
> Thanks Emmanuel for the sync and sharing. The approach looks pretty good, and 
> is often seen in many mature products. The repair process is triggered when 
> corruption is found while the server is running, or while restarting with a 
> specific option? Or the both? If the repairing stuff is not easy to 
> integrate, maybe a repair tool like the one Kiran worked out sounds good to 
> use? Or during startup time checking the data/index, if not fine then Java 
> system launching the tool process for the fixing? Just some thoughts, in case 
> some useful.

The corruption happens in some rare cases, and it's mostly due to some 
concurrent updates. Let me explain what happens in detail, and sorry of it's a 
big lengthly, it has to be.

We store entries in what we call the MasterTable. Entries are serialized, and 
each one of them has an associated ID (actually, we are using random UUID). So 
the master table is containing tuples of <UUID,
Entry>.
Each index refers to this MasterTable using the entry UUID. Typically, let's 
say an entry has an ObjectClass person, then the ObjectClass index will have a 
tuple <ObjectClass, Set<UUID>> wher ethe set contains all the Entry's UUID of 
entrues that has the 'person' ObjectClass.

We also have one special index, the Rdn index. This one is more complex, 
because it is used for two things : refering to an entry from a RDN, and also 
keep a track of the hierarchy. If we have an entry which DN is 
ou=users,dc=example,dc=com, where dc=exmple,dc=com is the partition's root, 
then the RDN index will contain two tuples for the full DN : one for the entry 
itself, and one for the suffix. Actually, we don't store tuples like <Rdn, ID>, 
but a more complex structure, the ParentIdAndRdn.
The reason is that we may have many entries using the same RDN. For instance :

entry 1 : cn=jSmith,ou=users,dc=example,dc=com
entry 2 : cn=jSmith,ou=administrators,dc=example,dc=com

That this jSmith is one person or two is irrelevant. The thing is that we can't 
associate the RDN cn=jSmith with one single entry, so what we store is a tuple 
<entryId1, <cn=jSmisth, parent entry ID>> for the first entry, and <entryId2, 
<cn=jSmisth, parent entry ID>> for the second entry. The difference is that we 
use the parent's ID of each entry as a discriminant. For those two entries, and 
their respective parents, we will have the following tuples stored in the 
reversed RDN index :

<root ID, <null, dc=example,dc=com>>   : for the root entry
'dc=example,dc=com'
<users ID, <rootId, ou=users>>         : for the
'ou=users,dc=example,dc=com' entry
<admin ID, <rootId, ou=administrators> : for the 
'ou=administrators,dc=example,dc=com' entry
<entry1 ID, <users ID, cn=jSmith>>     : for the
'cn=jSmith,ou=users,dc=example,dc=com' entry
<entry2 ID, <admin ID, cn=jSmith>>     : for the
'cn=jSmith,ou=administrators,dc=example,dc=com' entry We also have a forward 
index, that contains <ParentIdAndRdn, ID> tuples, like :

<<null, dc=example,dc=com>, root ID>
<<rootId, ou=users>, users ID>
<<rootId, ou=administrators>, admin ID>
<<users ID, cn=jSmith>, entry1 ID>
<<admin ID, cn=jSmith>, entry2 ID>

That is a second index we need to update for each added entry.

Why do we have the reverse index ?  You have already realized we don't store 
the DN in the entry, nor do we store the full DN in the RDN index.
That mean we have to recreate this DN using the RDN index. For entry 2, 
assuming we have it's ID, we must use the reverse index to do so. With the 
entry2 ID, we pull the <entry2 ID, <admin ID, cn=jSmith>> tuple, that givew us 
the entry's parent, adminID. We now can fetch the admin's parent from the 
<admin ID, <rootId, ou=administrators> tuple with another fetch from the index, 
and finally with the <root ID, <null, dc=example,dc=com>> tuple, we get the 
last RDNs. We can now reconstruct the full DN.

Why do we have the forward index then ? Because we need to be able to fetch an 
entry from a DN, when we are processing a search (because the BaseDN is 
provided). We process such a search by reading the DN, spliting it in RDNs, and 
for each RDN, from right to left, we search the entry ID. Let's say we search 
for cn=jSmith,ou=administrators,dc=example,dc=com, we first look in the RDN 
forward index with the <null, dc=example,dc=com> key, which gives us the root 
ID? then we look for <root ID, ou=administrators> in the index, and get back 
the admin ID? Last, not least, we look for <admin ID, cn=jSmith> and we get 
what we were looking for, the entry2 ID.

As one can see, there are 3 accesses to teh Rdn index everytime we do a search, 
and 3 more access when we construct the DN for the found entry (for such a DN. 
If teh DN size is 45, then we will do 5 access plus 5 access).

That also mean we update the RdnIndex many times, times 2 (as we have a revert 
and forward index). Many times ? You may object that we only have to add the 
RDN, as the parent's RDN always exist, and you would be right. Except that we 
also update the count in all the parents. The idea is that each entry keep a 
track of the number of its direct children and teh count of all their 
descendants. This is critical for search that involve SUBTREE or ONELEVEL, as 
it may help to keep the search using the best possible index, depending on the 
number of candidates we have.

So bottom line, every update operation (add, delete, move) will update the RDN 
index DN.size() * 2 times (the Rename operation will just update the Rdn index 
twice).

That beaing said, all those operations have to be done as an atomic operation : 
ie either all the updates are done, or none is done. If we imagine that a crash 
occurs in the middle of the RDN index update, we will never be able to 
reconstruct the full DN, as the RdnIndex will not be complete.

We also have another constraint : we can't allow concurrent updates to be done, 
as each update *will* update the same Rdn index, and they have to see a fully 
consistant database. We do serialize the updates for this reason (ie, you can 
only do a write when any other write is done).

Let's go a bit deeper : We can't either do a search while a write is being 
processed, because during a write, we are reshuffling the indexes, and the 
searches *are* using the same indexes.

To be clear, we prioritize the reads over the writes, but when a write arrives, 
we put it on hold until all the pending reads are completed, and all the newer 
reads or writes are also put on hold. Then we execute the write, and now we can 
process the operations that where on hold.
Reads can be processed concurrently though.

Now, we also update many other indexes : presence, alias, oneAlias, 
objectClass, entryCSN, entryUUID. That for mandatory indexes. We also have all 
the users index (generally a few, but it can be tens of them).
A crash in the middle of any update on any of those indexes can left the 
database in a dire state. (crash or brutal shutdown).

Bottom line, Mavibot being a MVCC database, we can't corrupt it even if we 
brutally stop the server, because then the intermeate state will never be 
visible. We see a fully consistant state, whatever happens.

Now, back to your question : can we detect a data corruption at startup ? Not 
easily. It would be costly, as we would have to read all the database, and 
check all the indexes. Better let the user run it as a workaround.
>
> I'm not very sure to rewrite JDBM though I know there're plenty of reasons to 
> do so, as most software rewritings do. But if we have start with new, 
> implementing something like B+ tree, that needs transaction support, I'm 
> wondering if we could do it by leveraging already industry proven backend, 
> because developing such backend may take long time effort and pretty much of 
> resources. I'm wondering if Sqlite could serve the purpose well or not, or 
> how it can be wrapped or adapted for the usage here. Again just a quick 
> thought and in case somewhat useful.

We have thought about our options. SqlLite is a C based database. It does not 
fit us, because we want to offer an embeddable LDAP server, so we need a Java 
based backend.

To be franck, the very first time we discussed about getting rid of JDBM was 
back in 2006 (yes, 10 years ago...) during the Austin Apache conference. back 
then, the CouchDB was *exactly* what we needed, except that it was Erlang 
based. Then we were stuck in more urgent fixes. In 2009, during the LDAP 
conference, I had a porof that our implementation was borked, and I added very 
heavy synchronisation to allow the LDAP server to accept concurrent - sort of - 
reads an writes. In 2011, we started on modifying JDBM so that it includes some 
MVCC mechanisms. We used a Write ahead Log solution, which proven to be very 
brittle.
Typically, when the API user was to forget to close a cursor, then the server 
would ends stuck in a deadlock. Finally, we decided to go on with Mavibot back 
in 2012, as it was first the solution for us, and second it would also be much 
faster, as we would not have to deal with locks when reading.

The first Mavibot version works. We tested it in the servr. The second one, the 
one we are currently working on, has to support transactions.
The big difference is that with transaction support, we can get rid of the 
locks we have put in the server, making the updates faster.

It's all in all, just a matter of time we can dedicate on working on Mavibot.

RE: JdbmPartition repair

Reply via email to