Re: On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Paolo Castagna Fri, 02 Sep 2011 03:13:39 -0700

Paolo Castagna wrote:
> Hi,
> I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL
> queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it
> I convinced myself it's not really necessary. Here is why.
> 
> In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve
> the scalability of ORDER BY + LIMIT queries avoiding a total sort.
> In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct
> replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries
> avoiding to keep an in-memory data structure of all the already seen bindings.
> 
> What can we do about DISTINCT + ORDER BY + LIMIT queries?
> 
> We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue
> if and only if a binding is not already there. So, this can be viewed as a
> further improvement of JENA-89.
> 
> However, I am not convinced anymore that this is really useful or a good
> idea, since we want to use QueryIterTopN (i.e. heap) for relatively small
> N in our LIMIT N clause.
> 
> If N is large, the optimisation described in JENA-90 kicks in and the
> slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over
> a small number of results is cheap.
> 
> Therefore we do not need to do anything special for DISTINCT + ORDER BY +
> LIMIT.


I was wrong. :-)

Now, I better understand how the two optimizations (in JENA-89 and JENA-90)
interact each others and I think there is another small step left to do.

I described the situation in JENA-108:
https://issues.apache.org/jira/browse/JENA-108

Maybe, I could have done everything in one shot... but, since on this I am
learning while doing, a small step at the time seems the best thing.

Paolo

> It's better, as Andy suggested, to invest on 'clever' caching and merge
> joins in TDB. There's not yet a JIRA issue for merge joins in TDB.
> 
> Paolo

Re: On SPARQL queries with DISTINCT + ORDER BY + LIMIT

Reply via email to