Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey Thu, 11 Jan 2007 16:43:40 -0800


On Jan 11, 2007, at 2:30 PM, jian chen wrote:

It seems to me that the impacted-sorted list makes sense if you aretryingto do pure vector space based ranking. This is from what I haveread fromthe research papers. They all talk about how to optimize the vectorspace
model using this impact-sorted list approach.

That makes sense. It would work well for Lucene queries that happento mimic pure vector queries: a simple TermQuery, or a BooleanQueryconsisting entirely of "should" clauses wrapping TermQueries.


Let's explore how it would work for other varieties of BooleanQuery...

  A -B

Not so good. We need to iterate over all the docs nums for B. Wecould start by fetching only some of the docs for A, and theniterating over all the docs for B and seeing whether we still haveenough docs after filtering. If we don't, and we have to go back, wehave to start over from scratch iterating over B. Either that or weneed to save set B in a BitVector just in case. Yuck.


  A +B

Hmm. You start with the highest ranking docs for B. You stop whenthe impact of A has dropped low enough that no subsequent documentcan displace a current doc in the HitQueue. But what if B is a low-scoring term compared to A, e.g. 'dinosaur +park' ? You'll have togo deep the list for 'park'. And things just get really complicatedand murky. I can't see how anything less than scoring all docs likewe do now will reliably produce decent precision.


  +A +B

Ouch. This is just a slightly less severe version of the phrasematching conundrum. I don't see a good way to handle this at all.Basically, you need to load two BitVectors and take the intersection,then go back and score.

I think we can stop there. I don't see any way to make an impact-sorted posting list work efficiently with boolean logic. It onlyworks with pure vector space.

The trick they used in Nutch wasn't to reorder the postings. Whatthey did was sort the entire index so that documents were ordered bydescending document boost. Say you want 10 results. If you grab thefirst, say, 100 hits, and sort them by score, the odds are prettygood that you'll find most of the same docs you would have foundafter crunching through your entire index. Grab the first 1000 hits,and the odds are even better. (That's assuming that document boostplays a dominant role in your scoring algorithm.) It's a goodheuristic. However, it doesn't coexist happily with incrementalindexing.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to