Toke Eskildsen created SOLR-3875:
------------------------------------

             Summary: Document boost does not work correctly when using 
multi-valued fields
                 Key: SOLR-3875
                 URL: https://issues.apache.org/jira/browse/SOLR-3875
             Project: Solr
          Issue Type: Bug
          Components: Schema and Analysis, update
    Affects Versions: 4.0-BETA, 4.0, 4.1, 5.0
            Reporter: Toke Eskildsen
             Fix For: 4.0, 4.1, 5.0, 4.0-BETA


In Solr 4 BETA & trunk, document boosts skews the ranking for documents with 
multi value fields tremendously. A document boost of 5 combined with 15 values 
in a multi value field results in scores above 1,000,000,000, while a boost of 
0,5 results in scores below 0,001. The error is not present in Solr 3.6.

Thomas Egense and I have tracked it down to a change in Solr DocumentBuilder 
committed 20110827 (@1162347) by Mike McCandless, as part of work done on 
LUCENE-2308. The problem is that Lucene multiplies the boosts of multiple 
instances of the same field when updating the index.

The old DocumentBuilder, used in Lucene 3.6, handled this by calculating the 
score for the field (docBoost*fieldBoost) and assigning it to the first 
instance of the field, then setting the boost to 1.0f and assigning that to 
subsequent instances of the field. This effectively assigned 
docBoost*fieldBoost to the field, regardless of the number of instances.

The updated DocumentBuilder (see 
https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_0/solr/core/src/java/org/apache/solr/update/DocumentBuilder.java?revision=1388778&view=markup),
 used in Lucene 4 BETA & trunk, also assigns docBoost*fieldBoost to the first 
instance of the field. Then it sets fieldBoost = docBoost and continues to 
assign docBoost*fieldBoost to subsequent instances. Using the example mentioned 
above, the generated IndexableFields will get assigned boosts of 5, 5*5, 5*5... 
5*5. As Lucene multiplies all the values, 15 instances of the same field will 
have a collective boost of 5*25^14.

This can be demonstrated with the Solr tutorial example by indexing the sample 
documents and adding the document 
{code:xml}
<add>
<doc boost="5">
  <field name="id">Insane score Example. Score = 10E9 </field>
  <field name="name">Document boost broken for multivalued fields</field>
  <field name="manu">Thomas Egense and Toke Eskildsen</field>
  <field name="manu_id_s">Test</field>
  <field name="cat">bug</field>
  <field name="features">insane_boost</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>
  <field name="features">something else</field>  
</doc>
</add>
{code}

The _manu_ & _features_-fields gets copied to _text_ and a search for _thomas_ 
matches the _text_-field with query explanation
{code:xml}
<str name="Insane score Example. Score = 10E10 ">
2.44373361E10 = (MATCH) weight(text:thomas in 0) [DefaultSimilarity], result of:
  2.44373361E10 = fieldWeight in 0, product of:
    1.0 = tf(freq=1.0), with freq of:
      1.0 = termFreq=1.0
    3.2512918 = idf(docFreq=3, maxDocs=38)
    7.5161928E9 = fieldNorm(doc=0)
</str>
{code}

Thomas and I are too pressed for time to attempt a proper patch at the moment, 
but we guess that a reversion to the old algorithm of assigning the combined 
boost to the first instance and 1.0f to all subsequent instances would work?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to