On Jan 11, 2007, at 2:30 PM, jian chen wrote:

It seems to me that the impacted-sorted list makes sense if you are trying to do pure vector space based ranking. This is from what I have read from the research papers. They all talk about how to optimize the vector space
model using this impact-sorted list approach.

That makes sense. It would work well for Lucene queries that happen to mimic pure vector queries: a simple TermQuery, or a BooleanQuery consisting entirely of "should" clauses wrapping TermQueries.

Let's explore how it would work for other varieties of BooleanQuery...

  A -B

Not so good. We need to iterate over all the docs nums for B. We could start by fetching only some of the docs for A, and then iterating over all the docs for B and seeing whether we still have enough docs after filtering. If we don't, and we have to go back, we have to start over from scratch iterating over B. Either that or we need to save set B in a BitVector just in case. Yuck.

  A +B

Hmm. You start with the highest ranking docs for B. You stop when the impact of A has dropped low enough that no subsequent document can displace a current doc in the HitQueue. But what if B is a low- scoring term compared to A, e.g. 'dinosaur +park' ? You'll have to go deep the list for 'park'. And things just get really complicated and murky. I can't see how anything less than scoring all docs like we do now will reliably produce decent precision.

  +A +B

Ouch. This is just a slightly less severe version of the phrase matching conundrum. I don't see a good way to handle this at all. Basically, you need to load two BitVectors and take the intersection, then go back and score.

I think we can stop there. I don't see any way to make an impact- sorted posting list work efficiently with boolean logic. It only works with pure vector space.

The trick they used in Nutch wasn't to reorder the postings. What they did was sort the entire index so that documents were ordered by descending document boost. Say you want 10 results. If you grab the first, say, 100 hits, and sort them by score, the odds are pretty good that you'll find most of the same docs you would have found after crunching through your entire index. Grab the first 1000 hits, and the odds are even better. (That's assuming that document boost plays a dominant role in your scoring algorithm.) It's a good heuristic. However, it doesn't coexist happily with incremental indexing.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to