Paolo Castagna wrote: > Hi, > I was about to create a new JIRA issue (an improvement) to 'optimise' SPARQL > queries with DISTINCT + ORDER BY + LIMIT. However, while I was writing it > I convinced myself it's not really necessary. Here is why. > > In JENA-89 we implemented a QueryIterTopN using a PriorityQueue to improve > the scalability of ORDER BY + LIMIT queries avoiding a total sort. > In JENA-90 we want to reduce the amount of memory used by QueryIterDistinct > replacing an OpDistinct with an OpReduced for DISTINCT + ORDER BY queries > avoiding to keep an in-memory data structure of all the already seen bindings. > > What can we do about DISTINCT + ORDER BY + LIMIT queries? > > We could provide a new QueryIterTopNDistinct which adds to a PriorityQueue > if and only if a binding is not already there. So, this can be viewed as a > further improvement of JENA-89. > > However, I am not convinced anymore that this is really useful or a good > idea, since we want to use QueryIterTopN (i.e. heap) for relatively small > N in our LIMIT N clause. > > If N is large, the optimisation described in JENA-90 kicks in and the > slicing is cheap. If N is small, JENA-89 kicks in and the DISTINCT over > a small number of results is cheap. > > Therefore we do not need to do anything special for DISTINCT + ORDER BY + > LIMIT.
I was wrong. :-) Now, I better understand how the two optimizations (in JENA-89 and JENA-90) interact each others and I think there is another small step left to do. I described the situation in JENA-108: https://issues.apache.org/jira/browse/JENA-108 Maybe, I could have done everything in one shot... but, since on this I am learning while doing, a small step at the time seems the best thing. Paolo > It's better, as Andy suggested, to invest on 'clever' caching and merge > joins in TDB. There's not yet a JIRA issue for merge joins in TDB. > > Paolo
