Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Aaron Daubman Wed, 18 Jul 2012 21:10:47 -0700

Greetings,

I've been digging in to this for two days now and have come up short -
hopefully there is some simple answer I am just not seeing:


I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
identically as possible (given deprecations) and indexing the same document.

For most queries the results are very close (scoring within three
significant differences, almost identical positions in results).

However, for certain documents, the scores are very different (causing
these docs to be ranked +/- 25 positions different or more in the results)

In looking at debugQuery output, it seems like this is due to fieldNorm
values being lower for the 3.6.0 instance than the 1.4.1.

(note that for most docs, the fieldNorms are identical)

I have taken the field values for the example below and run them
through /admin/analysis.jsp on each solr instance. Even for the problematic
docs/fields, the results are almost identical. For the example below, the
t_tag values for the problematic doc:
1.4.1: 162 values
3.6.0: 164 values

note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1,
however, (1/0.0625)^2 = 256, which is no where near 164

Here is a particular example from 1.4.1:
1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.3750753 = idf(docFreq=27619, maxDocs=2194294)
   0.078125 = fieldNorm(field=t_tag, doc=2066419)

And the same from 3.6.0:
1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.388126 = idf(docFreq=27740, maxDocs=2232857)
   0.0625 = fieldNorm(field=t_tag, doc=1977957)


Here is the 1.4.1 config for the t_tag field and text type:
    <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100">
          <analyzer>
              <tokenizer class="solr.StandardTokenizerFactory"/>
              <filter class="solr.StandardFilterFactory"/>
              <filter class="solr.ISOLatin1AccentFilterFactory"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.StopFilterFactory" words="stopwords.txt"
ignoreCase="true"/>
              <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
          </analyzer>
      </fieldtype>
<dynamicField name="t_*" type="text" indexed="true" stored="true"
required="false" multiValued="true" termVectors="true"/>


And 3.6.0 schema config for the t_tag field and text type:
        <fieldtype name="text" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.StandardFilterFactory"/>
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true"/>
                <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
        </fieldtype>
        <field name="t_tag" type="text" indexed="true" stored="true"
required="false" multiValued="true"/>

I at first got distracted by this change between versions:
LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This
means that terms with a position increment gap of zero do not affect the
norms calculation by default.
However, this doesn't appear to be causing the issue as, according to
analysis.jsp there is no overlap for t_tag...

Can you point me to where these fieldNorm differences are coming from and
why they'd only be happing for a select few documents for which the content
doesn't stand out?

Thank you,
     Aaron

Frustrating differences in fieldNorm between two different versions of solr indexing the same document

Reply via email to