Lucene is a combination of the vector space similarity and Boolean models. Lucene's queries a ranked Boolean query. Documents must meet certain Boolean criteria, but this list is then ranked by similarity score. If you didn't care about returning the "top" hits, then I would agree that the docId sorted list would be perfectly applicable. However, in ranked query, even one where Boolean constraints are enforced, Impacts can be more efficient. (more to follow).
> -----Original Message----- > From: Ming Lei [mailto:[EMAIL PROTECTED] > Sent: Wednesday, January 10, 2007 5:41 PM > To: java-dev@lucene.apache.org > Subject: Re: Beyond Lucene 2.0 Index Design > > I have a couple of questions about the original post of the > new index design: > > (1) Question on the posting list > > > f. <impact, num_docs, ([doc1, freq > > ,<positions>],...[docN, freq > > > > > ,<positions>]) > What is the "impact" per posting list? I am under the > impression that "impact" or "frequency" is per pair of doc and term. > > And it seem that "impact" or "frequency" needs to be stored > for each doc on the posting list of a term. The reasons are > two: To efficiently stop the traversal at some point at > search time by looking at the "impact" > value. And to get the component score without > re-cacalculation at search time. > > > (2) I wonder whether Lucene is really based upon vector-space > model. I am under the impression that the hits are selected > using boolean model and only the scoring (on the hit set) > uses vector space model. > > If so, the effect on boolean queries are not very positive. > > For a query like "termA AND termB", I suppose the posting > lists of both A and B have to be fully traversed, right? The > partial traversal is only possible for disjunctions or single > term query. And the join of the two posting lists will be > most costly than on the original docID-sorted posting lists. > > (3) As to Jian's question below, > A phrase query is a special case of a conjunctive boolean query. > > Michael > > > --- jian chen <[EMAIL PROTECTED]> wrote: > > > Hi, Jeff, > > > > Also, how to handle the phrase based queries? > > > > For example, here are two posting lists: > > > > TermA: X Y > > TermB: Y X > > > > I am not sure how you would return document X or Y for a > search of the > > phrase "TermA Term B". Which should come first? > > > > Thanks, > > > > Jian > > > > On 1/9/07, Dalton, Jeffery <[EMAIL PROTECTED]> > > wrote: > > > > > > I'm not sure we fully understand one another, but > > I'll try to explain > > > what I am thinking. > > > > > > Yes, it has use after sorting. It is used at > > query time for document > > > scoring in place of the TF and length norm > > components (new scorers > > > would need to be created). > > > > > > Using an impact based index moves most of the > > scoring from query time to > > > index time (trades query time flexibility for > > greatly improved query > > > search performance). Because the field boosts, > > length norm, position > > > boosts, etc... are incorporated into a single > > document-term-score, you > > > can use a single field at search time. It allows > > one posting list per > > > query term instead of the current one posting list > > per field per query > > > term (MultiFieldQueryParser wouldn't be necessary > > in most cases). In > > > addition to having fewer posting lists to examine, > > you often don't need > > > to read to the end of long posting lists when > > processing with a > > > score-at-a-time approach (see Anh/Moffat's Pruned > > Query Evaluation Using > > > Pre-Computed Impacts, SIGIR 2006) for details on > > one potential > > > algorithm. > > > > > > I'm not quite sure what you mean when mention > > leaving them out and > > > re-calculating them at merge time. > > > > > > - Jeff > > > > > > > -----Original Message----- > > > > From: Marvin Humphrey > > [mailto:[EMAIL PROTECTED] > > > > Sent: Tuesday, January 09, 2007 2:58 PM > > > > To: java-dev@lucene.apache.org > > > > Subject: Re: Beyond Lucene 2.0 Index Design > > > > > > > > > > > > On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery > > wrote: > > > > > > > > > e. <impact, num_docs, (doc1,...docN)> f. <impact, num_docs, > > > > > ([doc1, freq > > ,<positions>],...[docN, freq > > > > > ,<positions>]) > > > > > > > > Does the impact have any use after it's used to > > sort the postings? > > > > Can we leave it out of the index format and > > recalculate at merge-time? > > > > > > > > Marvin Humphrey > > > > Rectangular Research > > > > http://www.rectangular.com/ > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > > > > > > > > ______________________________________________________________ > ______________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]