To Sort or not to Sort

Scott Smith Thu, 16 Dec 2004 16:46:23 -0800

I'm hoping someone has an opinion (based on some experience) as to how I
might approach a design I'm doing with Lucene.


 

In my application, users search for messages with Lucene.  Typically,
they are more interested in seeing their hits in date-order than in
relevance-order.  In reading my ebook copy of "Lucene in action" (wish
I'd had that a year ago), I find that one of the features added in 1.4
was the ability to ask for hits in an order based on a field.  It also
looks like adding the field necessary to get things by date order is
straight forward.

 

But, for my browser-based application I think there is another
consideration.  Users will typically page through the messages 20-50 at
a time and often they will only look through the first few "pages" of
messages and then be done.  So, I think there are two possible designs.

 

1.      Simply use the built-in lucene sort functionality, cache the hit
list and then page through the list.  Adv: looks pretty straight
forward, I write less code.  Dis: for searches that return a large
number of hits (having a search return several hundred to a few thousand
hits is not uncommon), Lucene is sorting a lot of entries that don't
really need to be sorted (because the user will never look at them) and
sorting tends to be expensive.
2.      The other solution uses a priority heap to collect the top N (or
next N) entries.  I still have to walk the entire hit list, but keeping
entries in a priority heap means I can determine the N entries I need
with a few comparisons and minimal sorting.  I don't have to sort a
bunch of entries whose order I don't care about.  Additionally, I don't
have to have all of the entries in memory at one time.  The big
disadvantage with this is that I have to write more code.  However, it
may be worth it if the performance difference is large enough. 

 

This may be one of those questions where the only answer is "code it
both ways and do speed trials".  I was just wondering if anyone had
enough experience with either method to offer an opinion.  Are there
things the Lucene sort is doing "under the covers" that will make it's
ability to sort much faster than what I can do with the hit lists since
I still have to force the IndexSearch object to retrieve all of the
Documents in the hits list?

 

 Opinions?

 

Scott

To Sort or not to Sort

Reply via email to