RE: Beyond Lucene 2.0 Index Design

Dalton, Jeffery Fri, 12 Jan 2007 05:55:05 -0800

Lucene is a combination of the vector space similarity and Boolean
models.  Lucene's queries a ranked Boolean query.  Documents must meet
certain Boolean criteria, but this list is then ranked by similarity
score.  If you didn't care about returning the "top" hits, then I would
agree that the docId sorted list would be perfectly applicable.
However, in ranked query, even one where Boolean constraints are
enforced, Impacts can be more efficient.  (more to follow).


> -----Original Message-----
> From: Ming Lei [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 10, 2007 5:41 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
> 
> I have a couple of questions about the original post of the 
> new index design:
> 
> (1) Question on the posting list
> > > f. <impact, num_docs, ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> What is the "impact" per posting list? I am under the 
> impression that "impact" or "frequency" is per pair of doc and term. 
> 
> And it seem that "impact" or "frequency" needs to be stored 
> for each doc on the posting list of a term. The reasons are 
> two: To efficiently stop the traversal at some point at 
> search time by looking at the "impact"
> value. And to get the component score without 
> re-cacalculation at search time.
> 
> 
> (2) I wonder whether Lucene is really based upon vector-space 
> model. I am under the impression that the hits are selected 
> using boolean model and only the scoring (on the hit set) 
> uses vector space model.
> 
> If so, the effect on boolean queries are not very positive. 
> 
> For a query like "termA AND termB", I suppose the posting 
> lists of both A and B have to be fully traversed, right? The 
> partial traversal is only possible for disjunctions or single 
> term query. And the join of the two posting lists will be 
> most costly than on the original docID-sorted posting lists.
> 
> (3) As to Jian's question below,
> A phrase query is a special case of a conjunctive boolean query. 
> 
> Michael
> 
> 
> --- jian chen <[EMAIL PROTECTED]> wrote:
> 
> > Hi, Jeff,
> > 
> > Also, how to handle the phrase based queries?
> > 
> > For example, here are two posting lists:
> > 
> > TermA: X Y
> > TermB: Y X
> > 
> > I am not sure how you would return document X or Y for a 
> search of the 
> > phrase "TermA Term B". Which should come first?
> > 
> > Thanks,
> > 
> > Jian
> > 
> > On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > I'm not sure we fully understand one another, but
> > I'll try to explain
> > > what I am thinking.
> > >
> > > Yes, it has use after sorting.  It is used at
> > query time for document
> > > scoring in place of the TF and length norm
> > components  (new scorers
> > > would need to be created).
> > >
> > > Using an impact based index moves most of the
> > scoring from query time to
> > > index time (trades query time flexibility for
> > greatly improved query
> > > search performance).  Because the field boosts,
> > length norm, position
> > > boosts, etc... are incorporated into a single
> > document-term-score, you
> > > can use a single field at search time.  It allows
> > one posting list per
> > > query term instead of the current one posting list
> > per field per query
> > > term (MultiFieldQueryParser wouldn't be necessary
> > in most cases).  In
> > > addition to having fewer posting lists to examine,
> > you often don't need
> > > to read to the end of long posting lists when
> > processing with a
> > > score-at-a-time approach (see Anh/Moffat's Pruned
> > Query Evaluation Using
> > > Pre-Computed Impacts, SIGIR 2006) for details on
> > one potential
> > > algorithm.
> > >
> > > I'm not quite sure what you mean when mention
> > leaving them out and
> > > re-calculating them at merge time.
> > >
> > > - Jeff
> > >
> > > > -----Original Message-----
> > > > From: Marvin Humphrey
> > [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, January 09, 2007 2:58 PM
> > > > To: java-dev@lucene.apache.org
> > > > Subject: Re: Beyond Lucene 2.0 Index Design
> > > >
> > > >
> > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery
> > wrote:
> > > >
> > > > > e. <impact, num_docs, (doc1,...docN)> f. <impact, num_docs, 
> > > > > ([doc1, freq
> > ,<positions>],...[docN, freq
> > > > > ,<positions>])
> > > >
> > > > Does the impact have any use after it's used to
> > sort the postings?
> > > > Can we leave it out of the index format and
> > recalculate at merge-time?
> > > >
> > > > Marvin Humphrey
> > > > Rectangular Research
> > > > http://www.rectangular.com/
> > > >
> > > >
> > > >
> > > >
> >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > >
> >
> ---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> > [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> > [EMAIL PROTECTED]
> > >
> > >
> > 
> 
> 
> 
>  
> ______________________________________________________________
> ______________________
> Do you Yahoo!?
> Everyone is raving about the all-new Yahoo! Mail beta.
> http://new.mail.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Beyond Lucene 2.0 Index Design

Reply via email to