Re: Beyond Lucene 2.0 Index Design

Marvin Humphrey Thu, 11 Jan 2007 23:25:22 -0800


On Jan 11, 2007, at 8:37 PM, Ming Lei wrote:

But practically, the approximation (as in my original
post) should work well enough for large corpus and
relevancy-driven retrieval.

The saving on disk access for large corpus (implies
very long posting list) will be huge by impact-sorted
posting list.

For 'A OR B', or a simple TermQuery, absolutely. But for anythingelse, I'm skeptical. How about...


   'townshend OR (the AND who)'

You'll start with a small number of docs which have both 'the' and'who'. And then, you'll vet them to see if they contain'townshend'; only a miniscule portion will, so you'll have to goredo... and redo... and redo...

All that redo-ing is going to be considerably less efficient thanjust plowing through the complete posting lists in the first place,right? That's a lot of disk seeks... not to mention all thoseredundant VInt reads.

And each time you redo, you'll need to build an intersection usingbitvectors or whatever, which requires RAM and processor above andbeyond what's needed to drop a doc-score pair into a HitCollector.

And trying to tune those heuristics so that you can tell where theproper pruning threshold lies for a given clause in the middle of acomplex boolean query... the scorers are already the hardest part ofLucene to grok. Won't this add more complexity to the place where wealready have more than enough? I shudder to think how difficult itwould become to uncover bugs in a pruning BooleanScorer.

I really want this to work out, because I can put it into KinoSearch0.20 if it does -- backwards compatibility is already out thewindow. But I just don't see how to make it happen.

Can you show us some code or pseudo-code for a BooleanScorer thatwould use impact-sorted posting lists?

I also see that a flexible framework to support
multiple indexing scheme will make Lucene a general
search framework and test bed for innovative search
algorithms and gain further ground in research
community.

Yes, that's the idea. There are limits to the amount of flexibilitywe can offer, though. With a unified postings file, especially withone file per field, changing the amount of information in eachposting is reasonably straightforward. Changing the sort order,OTOH, is not, because every component in the scoring apparatusassumes that it will be fed document numbers in ascending order.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to