Hi all, doing the solrCloud examples and one thing I am not clear about is the scoring in a distributed search.
I did a small test where I used the "Example A: Simple two shard cluster" from wiki:SolrCloud and additional added java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar ipod_other.xml java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar monitor2.xml Now requesting http://localhost:8983/solr/collection1/select?distrib=true&q=electronics&fl=score&shards=localhost:8983/solr,localhost:7574/solr for both host will return the same result. Here we get the score for each hit based on the shard specific score and merge them into one result doc. However when I add monitor2.xml as well to 7574 which previously did not contained this, the scoring changes depending on the server I request. The score returned for 8983 is always <float name="score">0.09289607</float> being distrib=true|false The score returned for 7574 is always <float name="score">0.121383816</float> being distrib=true|false So is it correct to assume that if a document is indexed in both shards the score which will predominate is the one from the host which has been requested? My client plan to distribute the current index into different shards. For example each "ConsejerÃa" (counseling) should be hosted in a shard. The critical point for the client is that the scoring is the same as in the big unique index they use right now for a distributed search. As I understand the current solrCloud implementation there is no concern about harmonizing the score. In my research I came across http://markmail.org/message/bhhfwymz5y7lvoj7 "The "IDF" part of the relevancy score is the only place that distributed search scoring won't "match up" with no distributed scoring because the document frequency used for the term is local to every core instead of global. If you distribute your documents fairly randomly to the different shards, this won't matter. There is a patch in the works to add global idf, but I think that even when it's committed, it will default to off because of the higher cost associated with it." the patch is https://issues.apache.org/jira/browse/SOLR-1632 However last comment is from 26/Jul/10 reporting the patch failed and a comment from Yonik give the impression that is not ready to use: "It looks like the issue is this: rewrite() doesn't work for function queries (there is no propagation mechanism to go through value sources). This is a problem when real queries are embedded in function queries." Is there a general interest to bring 1632 to the trunk (especially for solrCloud)? Or may it be better to look into something that aims to scale the index into hbase so he does not lose the scoring. TIA for your feedback -- Thorsten Scherler <thorsten.at.apache.org> codeBusters S.L. - web based systems <consulting, training and solutions> http://www.codebusters.es/
smime.p7s
Description: S/MIME cryptographic signature