Marvin, Several posts back on this thread, I talked about an algorithm of impact-sorted posting list for conjunctive boolean query. Your concerns on impact-sorting in boolean retrieval model is valid. But practically, the approximation (as in my original post) should work well enough for large corpus and relevancy-driven retrieval.
The saving on disk access for large corpus (implies very long posting list) will be huge by impact-sorted posting list. I also see that a flexible framework to support multiple indexing scheme will make Lucene a general search framework and test bed for innovative search algorithms and gain further ground in research community. I would happy to put my efforts on this. Michael Lei Search Labs Live Search, Microsoft Corp. > That makes sense. It would work well for Lucene > queries that happen > to mimic pure vector queries: a simple TermQuery, or > a BooleanQuery > consisting entirely of "should" clauses wrapping > TermQueries. > > Let's explore how it would work for other varieties > of BooleanQuery... > > A -B > > Not so good. We need to iterate over all the docs > nums for B. We > could start by fetching only some of the docs for A, > and then > iterating over all the docs for B and seeing whether > we still have > enough docs after filtering. If we don't, and we > have to go back, we > have to start over from scratch iterating over B. > Either that or we > need to save set B in a BitVector just in case. > Yuck. > > A +B > > Hmm. You start with the highest ranking docs for B. > You stop when > the impact of A has dropped low enough that no > subsequent document > can displace a current doc in the HitQueue. But > what if B is a low- > scoring term compared to A, e.g. 'dinosaur +park' ? > You'll have to > go deep the list for 'park'. And things just get > really complicated > and murky. I can't see how anything less than > scoring all docs like > we do now will reliably produce decent precision. > > +A +B > > Ouch. This is just a slightly less severe version > of the phrase > matching conundrum. I don't see a good way to > handle this at all. > Basically, you need to load two BitVectors and take > the intersection, > then go back and score. > > I think we can stop there. I don't see any way to > make an impact- > sorted posting list work efficiently with boolean > logic. It only > works with pure vector space. > > The trick they used in Nutch wasn't to reorder the > postings. What > they did was sort the entire index so that documents > were ordered by > descending document boost. Say you want 10 results. > If you grab the > first, say, 100 hits, and sort them by score, the > odds are pretty > good that you'll find most of the same docs you > would have found > after crunching through your entire index. Grab the > first 1000 hits, > and the odds are even better. (That's assuming that > document boost > plays a dominant role in your scoring algorithm.) > It's a good > heuristic. However, it doesn't coexist happily with > incremental > indexing. > > Marvin Humphrey > Rectangular Research > http://www.rectangular.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > ____________________________________________________________________________________ Bored stiff? Loosen up... Download and play hundreds of games for free on Yahoo! Games. http://games.yahoo.com/games/front --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]