Re: Does Lucene Supports Billions of data

Otis Gospodnetic Thu, 01 May 2008 17:35:40 -0700

Right.  And the typical answer to that is:

- If your terms are roughly equally distributed in all N indices (e.g. random 
doc->index/shard assignment), the relevance score will roughly match.


- If you have business rules for doc->index/shard distribution, then your 
relevance scores will not be comparable.

Otis 

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
> From: Toke Eskildsen <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Friday, May 2, 2008 12:13:04 AM
> Subject: Re: Does Lucene Supports Billions of data
> 
> From: John Wang 
> [...]
> > sub index 1: 1 billion docs
> > sub index 2: 1 billion docs
> > sub index 3: 1 billion docs
> > 
> > federating search to these subindexes, you represent an index of 3 
> > billiondocs, and all internal doc ids are of type int.
> 
> That falls under Daniel's "...unless you wrap your own framework around it". 
> The 
> problem with the solution you're describing is that it's not functionally 
> equivalent to a single index of 3 billion docs.
> 
> If you just create 3 independent indexes and merge the top hits from all 3, 
> the 
> ranking of the documents will be messed up. You'll need to make sure that the 
> scores from the different indexes can be compared. That's tricky when the 
> score 
> depends on the frequency of the terms in the whole corpus.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does Lucene Supports Billions of data

Reply via email to