Re: Does Lucene Supports Billions of data

mark harwood Fri, 02 May 2008 01:36:53 -0700

>> If your terms are roughly equally distributed in all N indices(e.g. random 
>> doc->index/shard assignment), the relevance score willroughly match.



Agreed. I did some formal benchmarking of local IDF vs global IDF relevance 
ranking recently.
I measured the movement of the top ranked document in a single index's results 
(global IDF) vs the same document's position in results merged from 2 remote 
indexes with randomized doc->shard assignment (a local IDF scheme). This 
distance was measured for a large number of real-world queries.
Results were very promising - the distributed ranking scheme very rarely 
differed from that of the single large index.

----- Original Message ----
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, 2 May, 2008 1:35:04 AM
Subject: Re: Does Lucene Supports Billions of data

Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random 
doc->index/shard assignment), the relevance score will roughly match.

- If you have business rules for doc->index/shard distribution, then your 
relevance scores will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Toke Eskildsen <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". 
> The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, 
> the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the 
> score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Supports Billions of data

Reply via email to