lucene 2.9 sorting algorithm

John Wang Wed, 14 Oct 2009 20:12:51 -0700

Hi guys:
    Looking at the 2.9 sorting algorithm, and while trying to understand
FieldComparator class, I was wondering about the following optimization: (I
am using StringOrdValComparator as an example)


Currently we have 1 instance of per segment data structure, e.g. (ords,vals
etc.), and we keep track of the reader context using the readerGen
array. We keep track of the top numHits elemnents by keeping a
reference to the bottom element. To convert from readerContext1 to
readerContext2, we need a mapping from value->ord, and the cost is a binary
search (this is can be done N*numhits times). For some instances of custom
sort extension, this lookup maybe expensive.

How about instead, we keep a N instances of segment data structure, all
compares within segment is done with ord compare. We never to across segment
compare until the end. Where a merge is done on N instances using the
Comparator. This way, we avoid the lookup from value->ord, and can keep the
simpler API: ScoreDocComparator, which is much easier to extend for custom
sorting. It is a higher memory cost, but it is an extra (N-1)*(numhits)
times the data structure, where N is the number of segments, and is not too
large.

I am probably missing some intricacies with the current sorting algorithm.
If so, please shine some light.

Thanks

-John

lucene 2.9 sorting algorithm

Reply via email to