On Jan 11, 2007, at 8:37 PM, Ming Lei wrote:
But practically, the approximation (as in my original
post) should work well enough for large corpus and
relevancy-driven retrieval.
The saving on disk access for large corpus (implies
very long posting list) will be huge by impact-sorted
posting list.
For 'A OR B', or a simple TermQuery, absolutely. But for anything
else, I'm skeptical. How about...
'townshend OR (the AND who)'
You'll start with a small number of docs which have both 'the' and
'who'. And then, you'll vet them to see if they contain
'townshend'; only a miniscule portion will, so you'll have to go
redo... and redo... and redo...
All that redo-ing is going to be considerably less efficient than
just plowing through the complete posting lists in the first place,
right? That's a lot of disk seeks... not to mention all those
redundant VInt reads.
And each time you redo, you'll need to build an intersection using
bitvectors or whatever, which requires RAM and processor above and
beyond what's needed to drop a doc-score pair into a HitCollector.
And trying to tune those heuristics so that you can tell where the
proper pruning threshold lies for a given clause in the middle of a
complex boolean query... the scorers are already the hardest part of
Lucene to grok. Won't this add more complexity to the place where we
already have more than enough? I shudder to think how difficult it
would become to uncover bugs in a pruning BooleanScorer.
I really want this to work out, because I can put it into KinoSearch
0.20 if it does -- backwards compatibility is already out the
window. But I just don't see how to make it happen.
Can you show us some code or pseudo-code for a BooleanScorer that
would use impact-sorted posting lists?
I also see that a flexible framework to support
multiple indexing scheme will make Lucene a general
search framework and test bed for innovative search
algorithms and gain further ground in research
community.
Yes, that's the idea. There are limits to the amount of flexibility
we can offer, though. With a unified postings file, especially with
one file per field, changing the amount of information in each
posting is reasonably straightforward. Changing the sort order,
OTOH, is not, because every component in the scoring apparatus
assumes that it will be fed document numbers in ascending order.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]