At ScaleCamp yesterday in the UK, I was listening to a talk on Xapian and the 
speaker said one of the optimizations they do when retrieving a large result 
set is that instead of managing a Priority Queue, they just allocate a large 
array to hold all of the results and then sort afterward.   Seemed like a good 
idea since you could avoid a whole slew of PQ operations at the cost of 
(possibly) some extra memory, so I thought I would see if anyone has looked at 
doing something similar here.  (Xapian is C++, I believe, so it likely has 
different perf. characteristics which may or may not matter).  

A few things come to mind:
1. In many cases, when someone is asking for a lot of results, my guess is they 
want all the results.  You often see this in people who are doing significant 
post processing on large result sets (often for machine learning)
2. My gut says there is actually some heuristic to be applied here, whereby if 
the requested number of results is some percentage of the total num. of hits, 
then do the whole results + sort, otherwise, use PQ.  Thus, this maybe would be 
useful even when doing something like, say, 500 or a 1000 docs, as opposed to 
the bigger cases I'm thinking about.  Still, don't need to prematurely optimize.
3.  I think we could almost implement this entirely w/ a Collector, except for 
the sort step, which presumably could be a callback (empty implementation in 
the base class).
4. When doing this, you don't bother to truncate the list when n < totalHits 
(although see below).

Perhaps, we could also even save having a big array for doc ids and instead 
just flip bits in a bit set (would still need the array for scores) and then 
materialize the bit set to an array at the end when the num requested is less 
than the total number of hits.

Anyone have thoughts on this?  Seems fairly trivial to crank out a Collector to 
do it, minus the post processing step, which would be relatively trivial to add.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to