Re: Cores and and ranking (search quality)

Walter Underwood Tue, 10 Mar 2015 11:40:36 -0700

On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:

> If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
> submit two docs that are 100% identical (with the exception of the unique-ID 
> fields, which is stored but not indexed) one to each core.  The question is, 
> during search, will both of those docs rank near each other or not? […]
> 
> Put another way: are docs from the smaller core (the one has 10 docs only) 
> rank higher or lower compared to docs from the larger core (the one with 
> 100,000) docs?


These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). 
With 10 docs, the document frequency statistics are effectively random noise, 
so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or 
lower in the two cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes 
see anomalies under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Cores and and ranking (search quality)

Reply via email to