Hi Emmanuel,

On Sun, May 9, 2010 at 6:42 PM, Emmanuel Lecharny <elecha...@gmail.com>wrote:
...

As soon as you consider that a delete operation will need to update all the
> index the deleted entry uses, you see that you can benefit from having such
> reverse index : you don't have to grab the entry from the backend, as you
> already have it's ID, which is all what you need. Thus, a delete operation
> is just about doing :
> - get the entry ID
> - for each index, if we have an <Entry ID> stored, then grab the associated
> values, and for each value, delete them from the forward index.
> - delete the entry from the MT
>
>
Yep this is one benefit of the reverse index. There are other uses for the
reverse index. To start let's just list the various index methods leveraging
the reverse index:


    boolean reverse( ID id ) throws Exception;


    boolean reverse( ID id, K attrVal ) throws Exception;


    boolean reverseGreaterOrEq( ID id ) throws Exception;


    boolean reverseGreaterOrEq( ID id, K attrVal ) throws Exception;


    boolean reverseLessOrEq( ID id ) throws Exception;


    boolean reverseLessOrEq( ID id, K attrVal ) throws Exception;


These various reverse lookups are used all over the XDBM search code base
with various expression Evaluators and Cursors. They are used to evaluate
assertion expressions in filters during search. Without reverse indices
entries must be accessed and deserialized from the master table for every
potential candidate before full filter evaluation takes place. This could be
massively expensive and would cause cache over-turn and most searches.  The
performance would degrade considerably and the whole search engine must be
altered to handle things like alias dereferencing differently which relies
heavily on reverse indices.

Certainly we can have improvements in the algorithm; it is by no means
perfect.  However please realize that this search engine design and db
structure has suited our needs for over 8 years with minor tweaks here and
there.  I'm fairly confident that the removal of reverse tables in indices
will be disastrous but in now way would I say we should not give it a try.
 But please experiment in another branch. This is by far one of the most
critical aspects of the LDAP server.

... SNIP

>
> Now, it's not enough to say 'kill the reverse index !'. There is one more
> reason why we want to have this reverse index, the question is : does it
> brings a lot of benefit ?
>
> So why do we need this reverse index ? We discussed about it with Stefan,
> and here is what it is used for : Suppose you have a search request with a
> filter like (&(ObjectClass=XXX)(cn=YYY)).
>
> The search engine will evaluate the number of entries each of those filters
> node will get back. Suppose it's N for the OC filter, and M for the cn
> filter. Let's say that N > M.
> Now, we will loop on the M entries to check if they fit the OC filter.
>
>
Yep looping on M reduces the search space.


> How do we do that ? The first approach would be to grab the entry, and
> check in memory of the filter (ObjectClass=XXX) match the entry. If not, we
> ditch the entry, otherwise, itas a valid candidate.
>
>
Right but you'll have to access and deserialize all candidates matching
(CN=YYY) even when the OC does not match XXX. This can be a massive number.

Thanks to the reverse table in the OC index, it is used to determine if the
entry in fact matches the (ObjectClass=XXX) assertion without requiring a
master table access along with deserialization overheads. Note that the
Cursor that returns (walks) the CN index is already returning candidates
matching the (cn=YYY) expression. So the reverse lookup is needed to be able
to assert the other part of the filter which btw can be arbitrarily deep
even though in this example it's this simple OC assertion.


> The second option is to use the reverse index : we have the entry ID (we
> havdn't grabbed the entry from disk yet), and we can see if the OC table
> contains a reference for this entry ID. If not, we can move to the next
> entry ID. Otherwise, we can grab the entry and return it.
>
>
>From my understand this is how things work today.


> Obviously there is some potential for a speedup. Now, let's consider the
> cases where we benefit from not grabbing the entry.
>
> 1) We grab all the entry using the smallest set selected with the filters
> pros :
> - Fast entry filtering, it's just a question on applying the filters on the
> entry
> - It can be used very late, as we have already grabbed the entry. The entry
> selection can not only be based on the filter, but can also be combined with
> virtual attributes which has been added on the grabbed entry
> - That means we can move the search engine out of the backend, and have it
> working on the top of the Interceptor chain
>
> cons :
> - We may read way more entries than necessary
>
>
Oh yes.


> 2) we just get the Entry ID and use the reverse index to select the entries
> pros :
> - we don't grab the entries unless absolutely necessary
>
> cons :
> - we may have to check in many indexes (as many as we have exprNodes in the
> filter expression, assuming that each of them are indexed), which means lot
> of O(log(N)) operations on these index (assuming that all those index are in
> memory, otherwise, if we hit the disk, the benefit is null)
> - as soon as we have one single non indexed exprNode in the filter we have
> to check, then we will have to grab the entry anyway. The benefit is that we
> may save some disk access.
> - of course, we have to maintain all the reverse indexes.
> - if the exprNode is associated with an attribute not present in te entry,
> we do a lookup in the index for nothing
> - if the intersection between the two exprNode is big (ie
> Nb(ObjectClass=XXX) inter Nb(cn=YYY)), then the gin will be low, and if this
> itersection is small, then it's likely that the smallest set has been
> already selected as the main index to use to grab entries, thus leading to a
> small number of entries to grab.
>
>
> So here is what I suggest :
> - get rid of those reverse index
> - or, at least, make it optionnal
>
> thoughts ?
>
>
If you want me to be frank, and not Alex :), I think it's not a good idea at
all however feel free to expriment in a branch. I'm giving my opinion here
because I don't want you to waste your time here but who knows there might
be a serious gain.

Also think about this. When you do your performance tests make sure you test
more than simple one entry lookups using a real world data set with real
filters that leverage indices because this is where these data structures
make a world of difference.

-- 
Alex Karasulu
My Blog :: http://www.jroller.com/akarasulu/
Apache Directory Server :: http://directory.apache.org
Apache MINA :: http://mina.apache.org
To set up a meeting with me: http://tungle.me/AlexKarasulu

Reply via email to