Re: JdbmPartition repair

Emmanuel Lécharny Sun, 24 Jan 2016 09:33:11 -0800

Le 24/01/16 16:47, Zheng, Kai a écrit :
> Thanks Emmanuel for the sync and sharing. The approach looks pretty good, and 
> is often seen in many mature products. The repair process is triggered when 
> corruption is found while the server is running, or while restarting with a 
> specific option? Or the both? If the repairing stuff is not easy to 
> integrate, maybe a repair tool like the one Kiran worked out sounds good to 
> use? Or during startup time checking the data/index, if not fine then Java 
> system launching the tool process for the fixing? Just some thoughts, in case 
> some useful.


The corruption happens in some rare cases, and it's mostly due to some
concurrent updates. Let me explain what happens in detail, and sorry of
it's a big lengthly, it has to be.

We store entries in what we call the MasterTable. Entries are
serialized, and each one of them has an associated ID (actually, we are
using random UUID). So the master table is containing tuples of <UUID,
Entry>.
Each index refers to this MasterTable using the entry UUID. Typically,
let's say an entry has an ObjectClass person, then the ObjectClass index
will have a tuple <ObjectClass, Set<UUID>> wher ethe set contains all
the Entry's UUID of entrues that has the 'person' ObjectClass.

We also have one special index, the Rdn index. This one is more complex,
because it is used for two things : refering to an entry from a RDN, and
also keep a track of the hierarchy. If we have an entry which DN is
ou=users,dc=example,dc=com, where dc=exmple,dc=com is the partition's
root, then the RDN index will contain two tuples for the full DN : one
for the entry itself, and one for the suffix. Actually, we don't store
tuples like <Rdn, ID>, but a more complex structure, the ParentIdAndRdn.
The reason is that we may have many entries using the same RDN. For
instance :

entry 1 : cn=jSmith,ou=users,dc=example,dc=com
entry 2 : cn=jSmith,ou=administrators,dc=example,dc=com

That this jSmith is one person or two is irrelevant. The thing is that
we can't associate the RDN cn=jSmith with one single entry, so what we
store is a tuple <entryId1, <cn=jSmisth, parent entry ID>> for the first
entry, and <entryId2, <cn=jSmisth, parent entry ID>> for the second
entry. The difference is that we use the parent's ID of each entry as a
discriminant. For those two entries, and their respective parents, we
will have the following tuples stored in the reversed RDN index :

<root ID, <null, dc=example,dc=com>>   : for the root entry
'dc=example,dc=com'
<users ID, <rootId, ou=users>>         : for the
'ou=users,dc=example,dc=com' entry
<admin ID, <rootId, ou=administrators> : for the
'ou=administrators,dc=example,dc=com' entry
<entry1 ID, <users ID, cn=jSmith>>     : for the
'cn=jSmith,ou=users,dc=example,dc=com' entry
<entry2 ID, <admin ID, cn=jSmith>>     : for the
'cn=jSmith,ou=administrators,dc=example,dc=com' entry
We also have a forward index, that contains <ParentIdAndRdn, ID> 
tuples, like :

<<null, dc=example,dc=com>, root ID>
<<rootId, ou=users>, users ID>
<<rootId, ou=administrators>, admin ID>
<<users ID, cn=jSmith>, entry1 ID>
<<admin ID, cn=jSmith>, entry2 ID>

That is a second index we need to update for each added entry.

Why do we have the reverse index ?  You have already realized we don't
store the DN in the entry, nor do we store the full DN in the RDN index.
That mean we have to recreate this DN using the RDN index. For entry 2,
assuming we have it's ID, we must use the reverse index to do so. With
the entry2 ID, we pull the <entry2 ID, <admin ID, cn=jSmith>> tuple,
that givew us the entry's parent, adminID. We now can fetch the admin's
parent from the <admin ID, <rootId, ou=administrators> tuple with
another fetch from the index, and finally with the <root ID, <null,
dc=example,dc=com>> tuple, we get the last RDNs. We can now reconstruct
the full DN.

Why do we have the forward index then ? Because we need to be able to
fetch an entry from a DN, when we are processing a search (because the
BaseDN is provided). We process such a search by reading the DN,
spliting it in RDNs, and for each RDN, from right to left, we search the
entry ID. Let's say we search for
cn=jSmith,ou=administrators,dc=example,dc=com, we first look in the RDN
forward index with the <null, dc=example,dc=com> key, which gives us the
root ID? then we look for <root ID, ou=administrators> in the index, and
get back the admin ID? Last, not least, we look for <admin ID,
cn=jSmith> and we get what we were looking for, the entry2 ID.

As one can see, there are 3 accesses to teh Rdn index everytime we do a
search, and 3 more access when we construct the DN for the found entry
(for such a DN. If teh DN size is 45, then we will do 5 access plus 5
access).

That also mean we update the RdnIndex many times, times 2 (as we have a
revert and forward index). Many times ? You may object that we only have
to add the RDN, as the parent's RDN always exist, and you would be
right. Except that we also update the count in all the parents. The idea
is that each entry keep a track of the number of its direct children and
teh count of all their descendants. This is critical for search that
involve SUBTREE or ONELEVEL, as it may help to keep the search using the
best possible index, depending on the number of candidates we have.

So bottom line, every update operation (add, delete, move) will update
the RDN index DN.size() * 2 times (the Rename operation will just update
the Rdn index twice).


That beaing said, all those operations have to be done as an atomic
operation : ie either all the updates are done, or none is done. If we
imagine that a crash occurs in the middle of the RDN index update, we
will never be able to reconstruct the full DN, as the RdnIndex will not
be complete.

We also have another constraint : we can't allow concurrent updates to
be done, as each update *will* update the same Rdn index, and they have
to see a fully consistant database. We do serialize the updates for this
reason (ie, you can only do a write when any other write is done).

Let's go a bit deeper : We can't either do a search while a write is
being processed, because during a write, we are reshuffling the indexes,
and the searches *are* using the same indexes.

To be clear, we prioritize the reads over the writes, but when a write
arrives, we put it on hold until all the pending reads are completed,
and all the newer reads or writes are also put on hold. Then we execute
the write, and now we can process the operations that where on hold.
Reads can be processed concurrently though.


Now, we also update many other indexes : presence, alias, oneAlias,
objectClass, entryCSN, entryUUID. That for mandatory indexes. We also
have all the users index (generally a few, but it can be tens of them).
A crash in the middle of any update on any of those indexes can left the
database in a dire state. (crash or brutal shutdown).


Bottom line, Mavibot being a MVCC database, we can't corrupt it even if
we brutally stop the server, because then the intermeate state will
never be visible. We see a fully consistant state, whatever happens.

Now, back to your question : can we detect a data corruption at startup
? Not easily. It would be costly, as we would have to read all the
database, and check all the indexes. Better let the user run it as a
workaround.
>
> I'm not very sure to rewrite JDBM though I know there're plenty of reasons to 
> do so, as most software rewritings do. But if we have start with new, 
> implementing something like B+ tree, that needs transaction support, I'm 
> wondering if we could do it by leveraging already industry proven backend, 
> because developing such backend may take long time effort and pretty much of 
> resources. I'm wondering if Sqlite could serve the purpose well or not, or 
> how it can be wrapped or adapted for the usage here. Again just a quick 
> thought and in case somewhat useful.

We have thought about our options. SqlLite is a C based database. It
does not fit us, because we want to offer an embeddable LDAP server, so
we need a Java based backend.

To be franck, the very first time we discussed about getting rid of JDBM
was back in 2006 (yes, 10 years ago...) during the Austin Apache
conference. back then, the CouchDB was *exactly* what we needed, except
that it was Erlang based. Then we were stuck in more urgent fixes. In
2009, during the LDAP conference, I had a porof that our implementation
was borked, and I added very heavy synchronisation to allow the LDAP
server to accept concurrent - sort of - reads an writes. In 2011, we
started on modifying JDBM so that it includes some MVCC mechanisms. We
used a Write ahead Log solution, which proven to be very brittle.
Typically, when the API user was to forget to close a cursor, then the
server would ends stuck in a deadlock. Finally, we decided to go on with
Mavibot back in 2012, as it was first the solution for us, and second it
would also be much faster, as we would not have to deal with locks when
reading.

The first Mavibot version works. We tested it in the servr. The second
one, the one we are currently working on, has to support transactions.
The big difference is that with transaction support, we can get rid of
the locks we have put in the server, making the updates faster.

It's all in all, just a matter of time we can dedicate on working on
Mavibot.

Re: JdbmPartition repair

Reply via email to