On 6/24/11 9:40 AM, Alex Karasulu wrote:
On Thu, Jun 23, 2011 at 7:20 PM, Emmanuel Lecharny<elecha...@gmail.com>  wrote:
Hi guys,

as I was reviewing the Table interface, and the Index hierarchy, I had a bit
of time to try to understand this part of the server I was not comfortable
with. And I saw that each index is using two tables : a forward and a
reverse table.

The forward table is obviously mandatory. It is used to retrieve the list of
entry's ID from an AT value. Fine. But what about the reverse table ?

Alex told me that it's useful when dealing with such filters :
(&(cn=A)(sn=B)).

The optimization engine will get the index that provides the smallest set of
candidate (say, CN here). We then grab from the CN forward table all the
entry's ID associated with the 'A' value. Let's call it the CN-IDs set.

Now, if we were to grab all the entries from the MasterTable to compare
their SN attribute with the 'B' value, t would be quite costly. We use the
SN's reverse table. For each entry's ID, we check if the associated value in
the SN's reverse table exists or not. If it exists, we have a valid
candidate, otherwise, we simply discard the candidate.

As we can see, we never fetch the entry unless we have a valid candidate (of
course, if the SN and CN AT are indexed).

However, I do think this reverse table is spurious. We can get the same
result by using the SN's forward table : for each entry's ID, check in the
SVn's forward table if the set of values (SN-IDs) contains the ID we pulled
from the CN-IDs set.

We can even do an SN-IDs ^ CN-IDs ( '^'<=>  intersection) to get the list of
valid candidates. No need to do N fetches in the reverse table (where N is
the number of ID in CN-IDs).

I think we can remove those reverse table, and I expect that it can speed up
the search a bit, but mainly speed up greatly any modification operation
(add, delete, modify) and even decrease the disk consumption.
I think this is possible. It would be interesting to test these two
different systems using different kinds of data. The number of
duplicate keys will obviously impact the performance: it will be a
trade off.

To test an Evaluator (i.e. sn=foo) on a candidate entry say id 37, we
need to perform a reverse lookup on the sn index. The btree key is 37
and we check if it contains a tuple with a value of 'foo'. If it does
then the Evaluator returns true if not it returns false. Now since we
know we're looking for tuple ('foo', 37) and we only have the forward
table we can lookup the 'foo' entries, and the values from duplicate
keys would be sorted. We would just need to check if 37 were in these
values.

The reverse index has no duplicate keys. The only way to get a
duplicate key in the reverse index is if the same entry (i.e. 37)
contained the same value ('foo') for the same (sn) attribute. And this
we know is not possible. So the lookups against the reverse table will
be faster.

I was thinking about something a bit different : as soon as you have grabbed the list of entry's ID from the first index, looking into the other indexes will also return a list of Entry's ID. Checking if those IDs are valid candidate can then be done in one shot : do the intersection of the two sets (they are ordered, so it's a O(n) operation) and just get the matching entries.

Compared to the current processing (ie, accessing the reverse index for *each* candidate), this will be way faster, IMO.

Note that it's true for the AND connector, but even for the OR connector, we don't have this kind of issue.
May be I'm missing something though. I have created a branch
(http://svn.apache.org/viewvc/directory/apacheds/branches/apacheds-no-reverse-index/)
to play with this idea.
As long as we're careful and don't just push this into the main branch
I'm very happy to experiment with this. Let's see really what the
costs are. Because I can sit around and theorize all day but some
benchmarks really show us what reality is about.
The main issue so far is with the RDN, oneLevel and Subtree indexes, which are using entry's ID, and are representing the tree hierarchy.

It impacts the way we delete entries, too, as we currently use the reverse index to quickly retrieve the values to remove from the forward index. However, we can just iterate on the entry's attribute to do the same thing, invoking the drop( K, Value) method instead of invoking the drop( ID ) method.

Very interesting experiment !

I'll commit what I'm coming with today.

--
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com

Reply via email to