Re: Beyond Lucene 2.0 Index Design

Ming Lei Thu, 11 Jan 2007 20:37:51 -0800

Marvin,
Several posts back on this thread, I talked about an
algorithm of impact-sorted posting list for
conjunctive boolean query. Your concerns on
impact-sorting in boolean retrieval model is valid.
But practically, the approximation (as in my original
post) should work well enough for large corpus and
relevancy-driven retrieval.


The saving on disk access for large corpus (implies
very long posting list) will be huge by impact-sorted
posting list.

I also see that a flexible framework to support
multiple indexing scheme will make Lucene a general
search framework and test bed for innovative search
algorithms and gain further ground in research
community.

I would happy to put my efforts on this.

Michael Lei
Search Labs
Live Search, Microsoft Corp.

> That makes sense.  It would work well for Lucene
> queries that happen  
> to mimic pure vector queries: a simple TermQuery, or
> a BooleanQuery  
> consisting entirely of "should" clauses wrapping
> TermQueries.
> 
> Let's explore how it would work for other varieties
> of BooleanQuery...
> 
>    A -B
> 
> Not so good.  We need to iterate over all the docs
> nums for B.  We  
> could start by fetching only some of the docs for A,
> and then  
> iterating over all the docs for B and seeing whether
> we still have  
> enough docs after filtering. If we don't, and we
> have to go back, we  
> have to start over from scratch iterating over B. 
> Either that or we  
> need to save set B in a BitVector just in case. 
> Yuck.
> 
>    A +B
> 
> Hmm.  You start with the highest ranking docs for B.
>  You stop when  
> the impact of A has dropped low enough that no
> subsequent document  
> can displace a current doc in the HitQueue.  But
> what if B is a low- 
> scoring term compared to A, e.g. 'dinosaur +park' ? 
> You'll have to  
> go deep the list for 'park'.  And things just get
> really complicated  
> and murky.  I can't see how anything less than
> scoring all docs like  
> we do now will reliably produce decent precision.
> 
>    +A +B
> 
> Ouch.  This is just a slightly less severe version
> of the phrase  
> matching conundrum.  I don't see a good way to
> handle this at all.   
> Basically, you need to load two BitVectors and take
> the intersection,  
> then go back and score.
> 
> I think we can stop there.  I don't see any way to
> make an impact- 
> sorted posting list work efficiently with boolean
> logic.  It only  
> works with pure vector space.
> 
> The trick they used in Nutch wasn't to reorder the
> postings.  What  
> they did was sort the entire index so that documents
> were ordered by  
> descending document boost.  Say you want 10 results.
>  If you grab the  
> first, say, 100 hits, and sort them by score, the
> odds are pretty  
> good that you'll find most of the same docs you
> would have found  
> after crunching through your entire index.  Grab the
> first 1000 hits,  
> and the odds are even better.  (That's assuming that
> document boost  
> plays a dominant role in your scoring algorithm.) 
> It's a good  
> heuristic.  However, it doesn't coexist happily with
> incremental  
> indexing.
> 
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 



 
____________________________________________________________________________________
Bored stiff? Loosen up... 
Download and play hundreds of games for free on Yahoo! Games.
http://games.yahoo.com/games/front

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

Reply via email to