Greetings, I've been digging in to this for two days now and have come up short - hopefully there is some simple answer I am just not seeing:
I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as identically as possible (given deprecations) and indexing the same document. For most queries the results are very close (scoring within three significant differences, almost identical positions in results). However, for certain documents, the scores are very different (causing these docs to be ranked +/- 25 positions different or more in the results) In looking at debugQuery output, it seems like this is due to fieldNorm values being lower for the 3.6.0 instance than the 1.4.1. (note that for most docs, the fieldNorms are identical) I have taken the field values for the example below and run them through /admin/analysis.jsp on each solr instance. Even for the problematic docs/fields, the results are almost identical. For the example below, the t_tag values for the problematic doc: 1.4.1: 162 values 3.6.0: 164 values note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1, however, (1/0.0625)^2 = 256, which is no where near 164 Here is a particular example from 1.4.1: 1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.3750753 = idf(docFreq=27619, maxDocs=2194294) 0.078125 = fieldNorm(field=t_tag, doc=2066419) And the same from 3.6.0: 1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.388126 = idf(docFreq=27740, maxDocs=2232857) 0.0625 = fieldNorm(field=t_tag, doc=1977957) Here is the 1.4.1 config for the t_tag field and text type: <fieldtype name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.ISOLatin1AccentFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> </analyzer> </fieldtype> <dynamicField name="t_*" type="text" indexed="true" stored="true" required="false" multiValued="true" termVectors="true"/> And 3.6.0 schema config for the t_tag field and text type: <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldtype> <field name="t_tag" type="text" indexed="true" stored="true" required="false" multiValued="true"/> I at first got distracted by this change between versions: LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. However, this doesn't appear to be causing the issue as, according to analysis.jsp there is no overlap for t_tag... Can you point me to where these fieldNorm differences are coming from and why they'd only be happing for a select few documents for which the content doesn't stand out? Thank you, Aaron