[solrCloud] Distributed IDF - scoring in the cloud

Thorsten Scherler Fri, 18 Feb 2011 04:08:24 -0800

Hi all,

doing the solrCloud examples and one thing I am not clear about is the
scoring in a distributed search.


I did a small test where I used the "Example A: Simple two shard
cluster" from wiki:SolrCloud and additional added 

java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar
ipod_other.xml

java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar
monitor2.xml

Now requesting
http://localhost:8983/solr/collection1/select?distrib=true&q=electronics&fl=score&shards=localhost:8983/solr,localhost:7574/solr
for both host will return the same result. Here we get the score for
each hit based on the shard specific score and merge them into one
result doc.

However when I add monitor2.xml as well to 7574 which previously did not
contained this, the scoring changes depending on the server I request.

The score returned for 8983 is always <float
name="score">0.09289607</float> being distrib=true|false

The score returned for 7574 is always <float
name="score">0.121383816</float> being distrib=true|false

So is it correct to assume that if a document is indexed in both shards
the score which will predominate is the one from the host which has been
requested?

My client plan to distribute the current index into different shards.
For example each "Consejería" (counseling) should be hosted in a shard.
The critical point for the client is that the scoring is the same as in
the big unique index they use right now for a distributed search.

As I understand the current solrCloud implementation there is no concern
about harmonizing the score.

In my research I came across
http://markmail.org/message/bhhfwymz5y7lvoj7
"The "IDF" part of the relevancy score is the only place that
distributed search scoring won't "match up" with no distributed
scoring because the document frequency used for the term is local to
every core instead of global.  If you distribute your documents fairly
randomly to the different shards, this won't matter.

There is a patch in the works to add global idf, but I think that even
when it's committed, it will default to off because of the higher cost
associated with it." the patch is
https://issues.apache.org/jira/browse/SOLR-1632

However last comment is from 26/Jul/10 reporting the patch failed and a
comment from Yonik give the impression that is not ready to use:

"It looks like the issue is this: rewrite() doesn't work for function
queries (there is no propagation mechanism to go through value sources).
This is a problem when real queries are embedded in function queries."

Is there a general interest to bring 1632 to the trunk (especially for
solrCloud)? 

Or may it be better to look into something that aims to scale the index
into hbase so he does not lose the scoring.

TIA for your feedback
-- 
Thorsten Scherler <thorsten.at.apache.org>
codeBusters S.L. - web based systems
<consulting, training and solutions>
http://www.codebusters.es/

smime.p7s
Description: S/MIME cryptographic signature

[solrCloud] Distributed IDF - scoring in the cloud

Reply via email to