This may not be directly relevant to Lucene, but I wanted to learn: How does a web search engine do something like this. Do they also "score every matching document on every query" OR do they pick a subset first based on some static/offlline ranking criteria then do what Lucene does OR do they search and find every matching document, pick a subset of the results based on a static ranking and then score that subset based on the query terms.
I guess the assumption I am making is that it's not practical to "score every matching document on every query" at www scale. May be that assumption is wrong. I also haven't understood how search scales :( -Antony On Sun, Oct 23, 2011 at 10:18 AM, Erick Erickson <erickerick...@gmail.com>wrote: > "Why would it matter...top 5 matches" Because Lucene has to calculate > the score of all documents in order to insure that it returns those 5 > documents. > What if the very last document scored was the most relevant? > > Best > Erick > > On Sun, Oct 23, 2011 at 3:06 PM, sol myr <solmy...@yahoo.com> wrote: > > Hi, > > > > We've noticed some Lucene performance phenomenon, and would appreciate an > explanation from anyone familiar with Lucene internals > > > > (I know Lucene as a user, but haven't looked under its hood). > > > > We have a Lucene index of about 30 million records. > > We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe"). > > The AND query had much better performance (AND takes about 500 millis, > while OR takes about 2000 millis). > > > > We wondered whether this has anything to do with the number of potential > matches? > > Our AND has only about 5000 matches (5000 documents contain *both* "john" > and "doe"). > > Our OR has about 8 million matches (8 million documents contain *either* > "john" or "doe"). > > > > > > Does this explain the performance difference? > > But why would it matter, as long as we take only the top 5 matches ( > indexSearcher.search(query, 5))...? > > Is there any other explanation? > > > > Thanks :) > > >