[ https://issues.apache.org/jira/browse/SOLR-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462105#comment-13462105 ]
Mark Miller commented on SOLR-3875: ----------------------------------- bq. patch with proposed test & fix +1 I applied the patch, inspected the fix, inspected the test. It looks right to me. I also ran all tests, and verified the new test fails as expected without the fix. > Document boost does not work correctly when using multi-valued fields > --------------------------------------------------------------------- > > Key: SOLR-3875 > URL: https://issues.apache.org/jira/browse/SOLR-3875 > Project: Solr > Issue Type: Bug > Components: Schema and Analysis, update > Affects Versions: 4.0-BETA > Reporter: Toke Eskildsen > Priority: Critical > Fix For: 4.0, 4.1, 5.0 > > Attachments: SOLR-3875.patch > > > In Solr 4 BETA & trunk, document boosts skews the ranking for documents with > multi value fields tremendously. A document boost of 5 combined with 15 > values in a multi value field results in scores above 1,000,000,000, while a > boost of 0,5 results in scores below 0,001. The error is not present in Solr > 3.6. > Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder > committed 20110827 (@1162347) by Mike McCandless, as part of work done on > LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple > instances of the same field when updating the index. > The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the > score for the field (docBoost*fieldBoost) and assigning it to the first > instance of the field, then setting the boost to 1.0f and assigning that to > subsequent instances of the field. This effectively assigned > docBoost*fieldBoost to the field, regardless of the number of instances. > The updated DocumentBuilder (see > https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup), > used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first > instance of the field. Then it sets fieldBoost = docBoost and continues to > assign docBoost*fieldBoost to subsequent instances. Using the example > mentioned above, the generated IndexableFields will get assigned boosts of 5, > 5*5, 5*5... 5*5. As Lucene multiplies all the values, 15 instances of the > same field will have a collective boost of 5*25^14. > This can be demonstrated with the Solr tutorial example by indexing the > sample documents and adding the document > {code:xml} > <add> > <doc boost="5"> > <field name="id">Insane score Example. Score = 10E9 </field> > <field name="name">Document boost broken for multivalued fields</field> > <field name="manu">Thomas Egense and Toke Eskildsen</field> > <field name="manu_id_s">Test</field> > <field name="cat">bug</field> > <field name="features">insane_boost</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > <field name="features">something else</field> > </doc> > </add> > {code} > The _manu_ & _features_-fields gets copied to _text_ and a search for > _thomas_ matches the _text_-field with query explanation > {code:xml} > <str name="Insane score Example. Score = 10E10 "> > 2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result > of: > 2.44373361E10 = fieldWeight in 0, product of: > 1.0 = tf(freq=1.0), with freq of: > 1.0 = termFreq=1.0 > 3.2512918 = idf(docFreq=3, maxDocs=38) > 7.5161928E9 = fieldNorm(doc=0) > </str> > {code} > Thomas and I are too pressed for time to attempt a proper patch at the > moment, but we guess that a reversion to the old algorithm of assigning the > combined boost to the first instance and 1.0f to all subsequent instances > would work? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org