Re: SolrCloud - Score calculation

2013-06-20 Thread Learner
Thanks for your response. 

So in case of SolrCloud, SOLR/zookeeper takes care of managing the indexing
/ searching. So in that case I assume most of the shards will be of equal
size (I am just going to push the data to a leader). I assume IDF wont be a
big issue then since the shards size are almost equal... Am I right?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805p4071900.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud - Score calculation

2013-06-20 Thread Jack Krupansky
Even if shards are exactly the same size, the distribution of terms may not 
be equal in each shard. But, yes, if shard size and term distribution are 
equal, then IDF should be comparable across shards, sort of.


-- Jack Krupansky

-Original Message- 
From: Learner

Sent: Thursday, June 20, 2013 11:05 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud - Score calculation

Thanks for your response.

So in case of SolrCloud, SOLR/zookeeper takes care of managing the indexing
/ searching. So in that case I assume most of the shards will be of equal
size (I am just going to push the data to a leader). I assume IDF wont be a
big issue then since the shards size are almost equal... Am I right?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805p4071900.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: SolrCloud - Score calculation

2013-06-19 Thread Upayavira
The reason for the issue you are seeing is the IDF component in te
score. IDF = inverse document frequency.

The document frequency is the number of times a document appears in the
index. The higher the document frequency, the mre common the term and
thus the less relevant it is. The document frequency is inverted to give
a higher number for more relevant terms.

Solr does not yet support distributed IDF. Therefore the document
frequency is a 3m shard will be higher (as a proportion of your index)
compared to your 30m shard, thus it ill score lower.

I am not aware of a multiplier you can use to fix this. There is a
distributed IDF ticket in JIRA, maybe that is mature enough and might
help you.

Upayavira

On Thu, Jun 20, 2013, at 01:56 AM, Learner wrote:
 Hi,
 
 Sorry if its a very basic question but I am pretty new to SolrCloud and I
 am
 trying to understand the underlying mechanism for calculating relevancy.
 
 Currently we are using SOLR 3.6.X and we use shards to perform
 distributed
 searching. Our shards are not of equal size hence sometimes the results
 are
 not as we expected. 
 
 For ex: Shard 1 has 30 million documents, Shard 2 has 30 millon documents
 and shard 3 has just 3 million documents (push indexing via message
 queue). 
 
 When we do a search using shards, documents from shard 1 and shard 2 gets
 higher priority compared to documents in shard 3 (since its smaller).
 Currently we add index time boost when adding documents to shard 3 so
 that
 the documents from shard 3 also comes up (higher) in search results.
 
 Now when using SolrCloud, say for example if one shard has person name
 repeated 5 times (with different unique id)  and we have one more same
 person name in shard 2 (with diff id), and when we do a search how does
 SOLR
 calculate the score? Does it do something like constant scoring across
 various shards in order to bring up the search results across various
 shards? How does the score gets calculated.. Does the score of all 6
 documents have same value(5 from shard 1 and 1 from shard 2 -if all the
 fields have same value except for unique id)? 
 
 Thanks,
 BB 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-Score-calculation-tp4071805.html
 Sent from the Solr - User mailing list archive at Nabble.com.