Join optimization with TDB

Alex Hall Tue, 19 Jun 2012 16:14:09 -0700

Following up on my message from last week... After doing a little bit of
profiling, it turns out that the query execution time was dominated by
lookups to the TDB node table to translate between a NodeId (wrapper around
a long) and a Node (an RDF node). These lookups happen as a result of
calling Algebra.compatible() to test for compatibility between bindings on
the LHS and RHS of a join (or in our case, a minus which has approximately
the same logic).


To give an idea of the scale of the problem caused by the naive minus
implementation that loops on the RHS for each row on the LHS: On a single
profiling run of an ASK query, I found that QueryIterMinus advanced the LHS
iterator 9600 times looking for a row that was not matched on the RHS. On
average, each iteration over the RHS processed 4800 rows before finding a
matching binding and removing the solution for the RHS for a total of 46
million binding compatibility tests. Two common variables between the LHS
and RHS gives a total of 4 node table lookups per test, for a grand total
of 184 million node table lookups in this query. Thankfully 99.99% of the
lookups hit the in-memory cache (nothing was ever evicted), but still
there's no way that many lookups are going to perform well. The query in
question took ~5 minutes to complete.

The good news in all this is, some small changes in the query logic to
reduce the number of compatibility tests will result in tremendous
performance improvement. I tested the same query with an initial
QueryIterMinus implementation that uses a fairly straightforward
HashSet-based index pre-computed on the RHS bindings (projected to the
common variables as determined by static analysis). Queries that previously
had timed out after 15 minutes had sub-second response times with a warm
node table cache using this new implementation.

Beyond this simple change, there are more potential optimizations that can
be made in situations where multiple iterations over a result set are
inevitable (or when a server under heavy load has lots of node table cache
misses and has to go to disk a lot). These optimizations depend on the fact
that, for a given node table, two NodeId's are equal if and only if their
mapped Nodes are also (syntactically) equal. Or, stated slightly
differently, it's impossible for two Node that compare as equal via
Object.equals() to map to different NodeId values. I believe this to be the
case based on a cursory look at the node table source, but this should be
verified by somebody more familiar with the code.

Once you've established the equivalence between NodeId equality and Node
equality (at least with respect to NodeId's from the same transaction in
the same node table) you can take advantage of this by detecting that the
two bindings being tested for compatibility are backed by NodeId's from the
same table. In this situation, you can just compare NodeId values instead
of looking up the nodes in the table. Not only does this eliminate node
table lookups, but it also simplifies the equality tests, which are now
simple integer comparisons instead of string comparisons. I think that
using such a strategy could give significant performance improvement for
queries that make heavy use of join operations.

Regards,
Alex

Join optimization with TDB

Reply via email to