>> If your terms are roughly equally distributed in all N indices(e.g. random >> doc->index/shard assignment), the relevance score willroughly match.
Agreed. I did some formal benchmarking of local IDF vs global IDF relevance ranking recently. I measured the movement of the top ranked document in a single index's results (global IDF) vs the same document's position in results merged from 2 remote indexes with randomized doc->shard assignment (a local IDF scheme). This distance was measured for a large number of real-world queries. Results were very promising - the distributed ranking scheme very rarely differed from that of the single large index. ----- Original Message ---- From: Otis Gospodnetic <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 2 May, 2008 1:35:04 AM Subject: Re: Does Lucene Supports Billions of data Right. And the typical answer to that is: - If your terms are roughly equally distributed in all N indices (e.g. random doc->index/shard assignment), the relevance score will roughly match. - If you have business rules for doc->index/shard distribution, then your relevance scores will not be comparable. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Toke Eskildsen <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Friday, May 2, 2008 12:13:04 AM > Subject: Re: Does Lucene Supports Billions of data > > From: John Wang > [...] > > sub index 1: 1 billion docs > > sub index 2: 1 billion docs > > sub index 3: 1 billion docs > > > > federating search to these subindexes, you represent an index of 3 > > billiondocs, and all internal doc ids are of type int. > > That falls under Daniel's "...unless you wrap your own framework around it". > The > problem with the solution you're describing is that it's not functionally > equivalent to a single index of 3 billion docs. > > If you just create 3 independent indexes and merge the top hits from all 3, > the > ranking of the documents will be messed up. You'll need to make sure that the > scores from the different indexes can be compared. That's tricky when the > score > depends on the frequency of the terms in the whole corpus. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________________ Sent from Yahoo! Mail. A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]