API changes impact on ApacheDS, take 2

Emmanuel Lécharny Fri, 13 May 2016 01:40:52 -0700

Hi guys,

last night, I started to play with the new API, using it in the server.
I started to fix the compilation failure (100 errors, more or less),
then I ran the tests on xdbm, which is the place where we process the
searchs and updates.


Not a surprise that this is impacted a lot...

<warning> long and technical mail...</warning>

We have numerous things that are changing in the way we process indexes.
Indexes are <Key, Value> tables, where the key is unique. We use a lot
of default indexes in ApacheDS, some of them having a reverse index (ie
the value becomes the key) :

Presence           : ObjectIdentifier -> Entry UUID
RDN                : ParentIdAndRdn -> Entry UUID
RDN reverse        : Entry UUID -> ParentIdAndRdn
Alias              : DN -> Entry UUID
Alias reverse      : Entry UUID -> DN
OneAlias           : Entry UUID -> EntryUUID
SubAlias           : Entry UUID -> EntryUUID
ObjectClass        : ObjectIdentifier -> Entry UUID
EntryCSN           : CSN -> Entry UUID
AdministrativeRole : ObjectIdentifier -> Entry UUID

and of course :

Master             : Entry UUID -> Entry

(where 'ObjectIdentifier' is an OID, '2.5.4.3' for instance)

We also have as many user index as needed, where the key can be whatever
the user defines, as soon as the AttributeType defines an Equality
matchingRule, ie one of : bitStringMatch, booleanMatch,
caseExactIA5Match, caseExactMatch, caseIgnoreIA5Match,
caseIgnoreListMatch, caseIgnoreMatch,
directoryStringFirstComponentMatch, distinguishedNameMatch,
generalizedTimeMatch, integerFirstComponentMatch, integerMatch,
keywordMatch, numericStringMatch, objectIdentifierFirstComponentMatch,
objectIdentifierMatch, OctetStringmatch, presentationAddressmatch,
protocolInformationmatch, telephoneNumberMatch, uniqueMemberMatch and
wordMatch.

Most of the index will use a String for the key, but it can also be an
integer, or a byte[], or anything needed to represent the AttributeType
syntax.


Ok, now let's see what kind of problem we have to face and to deal with.
First of all, we said that the key must be unique, and in some cases,
many representations are acceptable for the same object in LDAP :
typically, 'cn', 'CommonName' and '2.5.4.3' are all representing the
same thing : the commonName attributeType. At this point, we must
normalize all those representation so that the key can be retrieved and
be unique. Another use case is for values that are equivalent,
accordingly to the MatchingRule. For instance a 'cn' value like 'ACME
company' is equivalent to '   aCmE    ComPAnY  ' (is, case is
insensitive, and spurious spaces are removed).

This makes it a bit problematic, because we have to make it so that the
key is unique, which means some a transformation has to be done before
injecting the key, and also when retrieving the key for the index. Last,
not least, LDAP allows the user to send some search request using
substrings for some AttributeTypes, and those substrings must also be
processed accordingly (normalized).

We can sumup this with the following transformation that have to be done :

User attribute --(normalization)--> normalized Key
User search    --(normalization)--> normalized Filter

Now, we have two possible strategies :

- first (and this is what we currently do in the trunk), we can
normalize all the values beforhand. That works spendidely, except when
String Preparation comes in the picture. Although in trunk, we do
normalize the keys by applying the PrepareString transformation, so we
are safe, except that we don't respect the String preparation RFC for
substring, so in some use cases, we may not retrieve a value when using
a Substring filter.

We can get this fixed by not applying the StringPreparation beforhand,
but only when matching the Key with the filter. We then store a
normalized form of the Key in the index (Value --(normalization)--> Key)
and retreiving the key from the index is all about doing :

index --> key --(String preparation)--> V1

filter --(normalization)--> normalized filter --(String Preparation)-->
V2  

and V1 compare V2 gives the result.

The problem with this approach is that the index implementation has no
clue about the way the values must be compared (especially when the
value is not a String, but a DN. That means we must provide a Comparator
to the index. And this is what we do.

*But*... That works well for plain searches, not for Substring searches,
because the String Preparation is not the same.

The way to workaround this issue is by pre-normalizing/Preparing any key
we want to retreive from an index. Where we were doing things like :

                        idx.add( value.getNormValue(), id );

we now have to do :

                        String normalized =
attributeType.getEquality().getNormalizer().normalize( value.getValue() );
                        idx.add( normalized, id );

(assuming the normalizer method also take care of String preparation).

This is kind of heavy...


2) Or we can take a very different approach : use a Value as a Key.

This approach is way simpler, because Value knows how to compare
themselves, so we don't have to take care of normalization before
injecting a Key : there is a guarantee that the key will be unique, no
matter what.

The big issue with this approach is that everytime we are going to fetch
such a Key, we will need to create a Value wrapper around it. It will
also take way more room on disk, because a serialized Value contains
both the userProvided value and the Normalized value. That double the
size of the key. We also have to deserialize it into an object.

Last, not least, that does not solve the Substrng problem : we still
don't correctly process the filter key the same way than for a full
search, unless we tell the wrapper Value how the interned value must be
normalized.

However, the Value instance store the normalized form beside the user
provided form, so we can imagine that we may tell the Value class how to
normalized substrings. That would require a modification in teh Value
class, as we will have to provide a dedicade normalizer for that purpose...


Side note : Substring search processing
---------------------------------------
In the server, processing a substring search is done using a regexp. ie,
if we consider a filter like : (cn=ABC*DEF*GEH), this will create a
regexp like 'abc.*def.*geh'. This works in trunk, it won't work in the
value branch because there are missing spaces at the beginning and at
the end of the regexp. This is very easy to fix : we just have to add
one space on both sides of the String (of course, this all depends on
the Matching Rule : a telephoneNumber value will not require any extra
spaces, for instance)

Searching for a value using a starting substring like 'cn=ABC*' is using
the standard index, as it's an ordered BTree. Searching for a value
using a ending substring like 'cn=*DEF' will require a reverted index
(ie, we compute the reverted string for each value, and create a btree
with the result : 'John Smith' will become 'htimS nhoJ'. That works just
fine. For inner substring like '*abc*', we don't have a solution and we
do a full scan.

Other LDPA servers are computing a special index using fractions of
Strings. For instance, the 'John Smith' value will be injected into a
Btree using the following keys : 'John', 'ohn ', 'hn S', 'n Sm', ' Smi',
'Smit' and 'mith'. That covers all the use cases, except that any search
using a substring like 'Jo*' will have to do a full scan, because there
is nothing that tells that 'Jo' is associated with an entry.

At this point, there is a balance to find, and we decided that we wanted
to favor searches using initial and final substring. It's faster for
most of those searches, plus the additionn of values requires way less
database updates, as we don't need a dedicated index wih many values
having to be injected.

Conclusion
----------
At this point, and considering the pros and cons of both options, I
think that keeping the index as they are is probably better. That means
we will have to modify the way we update and search data in the
AbstractBtrePartition class. The good thing is that there is only one
place where all the changes must take place. The bad thing is that a lot
of the Partition instances tests will have to be changed...

This is what I'm working on atm, and I have 132 out of 153 tests working
in XDBM. Most of the failures are due to some missing modifications in
many operation (I only tested the changes for the 'add' operation).


I may come with some better idea, or you may propose some feedback and
better ideas, I would really welcom them. At the moment, I will move
forward in the branch and keep going with updates and feedback.

thanks !

API changes impact on ApacheDS, take 2

Reply via email to