Re: performance question - number of documents

sol myr Mon, 24 Oct 2011 05:30:00 -0700

Hi,

Thanks for this reply.

Could I please just ask - doesn't Lucene keep the data sorted, at least 
partially (heuristically)?

E.g. if the reverse index says "the word DOE appears in documents #1, #7, #5" .
Won't Lucene do some smart sorting on this list of documents? Maybe by 
frequency, first listing documents that contain many  appearances of  DOE?

I know ranking considers more subtle factors such as document length, "idf" to 
prioritize rare words, etc.
But if there are 8 million documents with the word DOE, and I only asked for 
the top 5, I might take a risk and limit the change to (say) 1000 documents 
that contain most appearances of that word, and only between them bother to 
calculate the exact ranking...

That's not criticism, I'm no algorithms expert, I just raise the question and 
try to learn...
Insights would be appreciated :)
Thanks again.

----- Original Message -----
From: Erick Erickson <[email protected]>
To: [email protected]; sol myr <[email protected]>
Cc: 
Sent: Sunday, October 23, 2011 7:18 PM
Subject: Re: performance question - number of documents

"Why would it matter...top 5 matches" Because Lucene has to calculate
the score of all documents in order to insure that it returns those 5 documents.
What if the very last document scored was the most relevant?

Best
Erick

On Sun, Oct 23, 2011 at 3:06 PM, sol myr <[email protected]> wrote:
> Hi,
>
> We've noticed some Lucene performance phenomenon, and would appreciate an 
> explanation from anyone familiar with Lucene internals
>
> (I know Lucene as a user, but haven't looked under its hood).
>
> We have a Lucene index of about 30 million records.
> We ran 2 queries: "AND" and "OR" ("+john +doe" versus "john doe").
> The AND query had much better performance (AND takes about 500 millis, while 
> OR takes about 2000 millis).
>
> We wondered whether this has anything to do with the number of potential 
> matches?
> Our AND has only about 5000 matches (5000 documents contain *both* "john" and 
> "doe").
> Our OR has about 8 million matches (8 million documents contain *either* 
> "john" or "doe").
>
>
> Does this explain the performance difference?
> But why would it matter, as long as we take only the top 5 matches ( 
> indexSearcher.search(query, 5))...?
> Is there any other explanation?
>
> Thanks :)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: performance question - number of documents

Reply via email to