Hi guys, last night, I started to play with the new API, using it in the server. I started to fix the compilation failure (100 errors, more or less), then I ran the tests on xdbm, which is the place where we process the searchs and updates.
Not a surprise that this is impacted a lot... <warning> long and technical mail...</warning> We have numerous things that are changing in the way we process indexes. Indexes are <Key, Value> tables, where the key is unique. We use a lot of default indexes in ApacheDS, some of them having a reverse index (ie the value becomes the key) : Presence : ObjectIdentifier -> Entry UUID RDN : ParentIdAndRdn -> Entry UUID RDN reverse : Entry UUID -> ParentIdAndRdn Alias : DN -> Entry UUID Alias reverse : Entry UUID -> DN OneAlias : Entry UUID -> EntryUUID SubAlias : Entry UUID -> EntryUUID ObjectClass : ObjectIdentifier -> Entry UUID EntryCSN : CSN -> Entry UUID AdministrativeRole : ObjectIdentifier -> Entry UUID and of course : Master : Entry UUID -> Entry (where 'ObjectIdentifier' is an OID, '2.5.4.3' for instance) We also have as many user index as needed, where the key can be whatever the user defines, as soon as the AttributeType defines an Equality matchingRule, ie one of : bitStringMatch, booleanMatch, caseExactIA5Match, caseExactMatch, caseIgnoreIA5Match, caseIgnoreListMatch, caseIgnoreMatch, directoryStringFirstComponentMatch, distinguishedNameMatch, generalizedTimeMatch, integerFirstComponentMatch, integerMatch, keywordMatch, numericStringMatch, objectIdentifierFirstComponentMatch, objectIdentifierMatch, OctetStringmatch, presentationAddressmatch, protocolInformationmatch, telephoneNumberMatch, uniqueMemberMatch and wordMatch. Most of the index will use a String for the key, but it can also be an integer, or a byte[], or anything needed to represent the AttributeType syntax. Ok, now let's see what kind of problem we have to face and to deal with. First of all, we said that the key must be unique, and in some cases, many representations are acceptable for the same object in LDAP : typically, 'cn', 'CommonName' and '2.5.4.3' are all representing the same thing : the commonName attributeType. At this point, we must normalize all those representation so that the key can be retrieved and be unique. Another use case is for values that are equivalent, accordingly to the MatchingRule. For instance a 'cn' value like 'ACME company' is equivalent to ' aCmE ComPAnY ' (is, case is insensitive, and spurious spaces are removed). This makes it a bit problematic, because we have to make it so that the key is unique, which means some a transformation has to be done before injecting the key, and also when retrieving the key for the index. Last, not least, LDAP allows the user to send some search request using substrings for some AttributeTypes, and those substrings must also be processed accordingly (normalized). We can sumup this with the following transformation that have to be done : User attribute --(normalization)--> normalized Key User search --(normalization)--> normalized Filter Now, we have two possible strategies : - first (and this is what we currently do in the trunk), we can normalize all the values beforhand. That works spendidely, except when String Preparation comes in the picture. Although in trunk, we do normalize the keys by applying the PrepareString transformation, so we are safe, except that we don't respect the String preparation RFC for substring, so in some use cases, we may not retrieve a value when using a Substring filter. We can get this fixed by not applying the StringPreparation beforhand, but only when matching the Key with the filter. We then store a normalized form of the Key in the index (Value --(normalization)--> Key) and retreiving the key from the index is all about doing : index --> key --(String preparation)--> V1 filter --(normalization)--> normalized filter --(String Preparation)--> V2 and V1 compare V2 gives the result. The problem with this approach is that the index implementation has no clue about the way the values must be compared (especially when the value is not a String, but a DN. That means we must provide a Comparator to the index. And this is what we do. *But*... That works well for plain searches, not for Substring searches, because the String Preparation is not the same. The way to workaround this issue is by pre-normalizing/Preparing any key we want to retreive from an index. Where we were doing things like : idx.add( value.getNormValue(), id ); we now have to do : String normalized = attributeType.getEquality().getNormalizer().normalize( value.getValue() ); idx.add( normalized, id ); (assuming the normalizer method also take care of String preparation). This is kind of heavy... 2) Or we can take a very different approach : use a Value as a Key. This approach is way simpler, because Value knows how to compare themselves, so we don't have to take care of normalization before injecting a Key : there is a guarantee that the key will be unique, no matter what. The big issue with this approach is that everytime we are going to fetch such a Key, we will need to create a Value wrapper around it. It will also take way more room on disk, because a serialized Value contains both the userProvided value and the Normalized value. That double the size of the key. We also have to deserialize it into an object. Last, not least, that does not solve the Substrng problem : we still don't correctly process the filter key the same way than for a full search, unless we tell the wrapper Value how the interned value must be normalized. However, the Value instance store the normalized form beside the user provided form, so we can imagine that we may tell the Value class how to normalized substrings. That would require a modification in teh Value class, as we will have to provide a dedicade normalizer for that purpose... Side note : Substring search processing --------------------------------------- In the server, processing a substring search is done using a regexp. ie, if we consider a filter like : (cn=ABC*DEF*GEH), this will create a regexp like 'abc.*def.*geh'. This works in trunk, it won't work in the value branch because there are missing spaces at the beginning and at the end of the regexp. This is very easy to fix : we just have to add one space on both sides of the String (of course, this all depends on the Matching Rule : a telephoneNumber value will not require any extra spaces, for instance) Searching for a value using a starting substring like 'cn=ABC*' is using the standard index, as it's an ordered BTree. Searching for a value using a ending substring like 'cn=*DEF' will require a reverted index (ie, we compute the reverted string for each value, and create a btree with the result : 'John Smith' will become 'htimS nhoJ'. That works just fine. For inner substring like '*abc*', we don't have a solution and we do a full scan. Other LDPA servers are computing a special index using fractions of Strings. For instance, the 'John Smith' value will be injected into a Btree using the following keys : 'John', 'ohn ', 'hn S', 'n Sm', ' Smi', 'Smit' and 'mith'. That covers all the use cases, except that any search using a substring like 'Jo*' will have to do a full scan, because there is nothing that tells that 'Jo' is associated with an entry. At this point, there is a balance to find, and we decided that we wanted to favor searches using initial and final substring. It's faster for most of those searches, plus the additionn of values requires way less database updates, as we don't need a dedicated index wih many values having to be injected. Conclusion ---------- At this point, and considering the pros and cons of both options, I think that keeping the index as they are is probably better. That means we will have to modify the way we update and search data in the AbstractBtrePartition class. The good thing is that there is only one place where all the changes must take place. The bad thing is that a lot of the Partition instances tests will have to be changed... This is what I'm working on atm, and I have 132 out of 153 tests working in XDBM. Most of the failures are due to some missing modifications in many operation (I only tested the changes for the 'add' operation). I may come with some better idea, or you may propose some feedback and better ideas, I would really welcom them. At the moment, I will move forward in the branch and keep going with updates and feedback. thanks !