On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Paolo Castagna Thu, 25 Aug 2011 14:00:24 -0700

Hi,
I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
I convinced myself it's not really necessary. Here is why.


In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
the scalability of ORDER BY + LIMIT queries avoiding a total sort.
In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
avoiding to keep an in-memory data structure of all the already seen bindings.

What can we do about DISTINCT + ORDER BY + LIMIT queries?

We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
if and only if a binding is not already there. So, this can be viewed as a
further improvement of JENA-89.

However, I am not convinced anymore that this is really useful or a good
idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
N in our LIMIT N clause.

If N is large, the optimisation described in JENA-90 kicks in and the
slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
a small number of results is cheap.

Therefore we do not need to do anything special for DISTINCT + ORDER BY +
LIMIT.

It's better, as Andy suggested, to invest on 'clever' caching and merge
joins in TDB. There's not yet a JIRA issue for merge joins in TDB.

Paolo

On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Reply via email to