Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Robert, So this is lossy: basically you can think of there being only 256 > possible values. So when you increased the number of terms only > slightly by changing your analysis, this happened to bump you over the > edge rounding you up to the next value. > > more information: > http://lucene.apache.org/core/3_6_0/scoring.html > > http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html Thanks - this was extremely helpful! I had read both sources before but didn't grasp the magnitude of lossy-ness until your pointer and mention of edge-case. Just to help out anybody else who might run in to this, I hacked together a little harness to demonstrate: --- fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109, byte315ToFloat: 0.078125 fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108, byte315ToFloat: 0.0625 fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108, byte315ToFloat: 0.0625 fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108, byte315ToFloat: 0.0625 --- So my takeaway is that these scores that vary significantly are caused by: 1) a field with lengths right on this boundary between the two analyzer chains 2) the fact that we might be searching for matches from 50+ values to a field with 150+ values, and so the overall score is repeatedly impacted by the otherwise typically insignificant change in fieldNorm value Thanks again, Aaron
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman wrote: > Apologies if I didn't clearly state my goal/concern: I am not looking for > the exact same scoring - I am looking to explain scoring differences. > Deprecated components will eventually go away, time moves on, etc... > etc... I would like to be able to run current code, and should be able to - > the part that is sticking is being able to *explain* the difference in > results. > OK: i totally missed that, sorry! to explain why you see such a large difference: The difference is that these length normalizations are computed at index time and fit inside a *single byte* by default. This is to keep ram usage low for many documents and many fields with norms (since its #fieldsWithNorms * #documents in bytes in ram). So this is lossy: basically you can think of there being only 256 possible values. So when you increased the number of terms only slightly by changing your analysis, this happened to bump you over the edge rounding you up to the next value. more information: http://lucene.apache.org/core/3_6_0/scoring.html http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html by the way: if you don't like this: 1. if you can still live with a single byte, maybe plug in your own Similarity class into 3.6, overriding decodeNormValue/encodeNormValue. For example, you could use a different SmallFloat configuration that has less range but more precision for your use case (if your docs are all short or whatever) 2. otherwise, if you feel you need more than a single byte, check out 4.0-ALPHA: you arent limited to a single byte there. -- lucidimagination.com
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Robert, > I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as > > identically as possible (given deprecations) and indexing the same > document. > > Why did you do this? If you want the exact same scoring, use the exact > same analysis. > This means specifying luceneMatchVersion = 2.9, and the exact same > analysis components (even if deprecated). > > > I have taken the field values for the example below and run them > > through /admin/analysis.jsp on each solr instance. Even for the > problematic > > docs/fields, the results are almost identical. For the example below, the > > t_tag values for the problematic doc: > > 1.4.1: 162 values > > 3.6.0: 164 values > > > > This is why: you changed your analysis. > Apologies if I didn't clearly state my goal/concern: I am not looking for the exact same scoring - I am looking to explain scoring differences. Deprecated components will eventually go away, time moves on, etc... etc... I would like to be able to run current code, and should be able to - the part that is sticking is being able to *explain* the difference in results. As you can see from my email, after running the different analysis on the input, the output does not demonstrate (in any way that I can see) why the fieldNorm values would be so different. Even with the different analysis, the results are almost identical - which *should* result in an almost identical fieldNorm??? Again, the desire is not to be the same, it is to understand the difference. Thanks, Aaron
Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document
On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman wrote: > Greetings, > > I've been digging in to this for two days now and have come up short - > hopefully there is some simple answer I am just not seeing: > > I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as > identically as possible (given deprecations) and indexing the same document. Why did you do this? If you want the exact same scoring, use the exact same analysis. This means specifying luceneMatchVersion = 2.9, and the exact same analysis components (even if deprecated). > I have taken the field values for the example below and run them > through /admin/analysis.jsp on each solr instance. Even for the problematic > docs/fields, the results are almost identical. For the example below, the > t_tag values for the problematic doc: > 1.4.1: 162 values > 3.6.0: 164 values > This is why: you changed your analysis. -- lucidimagination.com
Frustrating differences in fieldNorm between two different versions of solr indexing the same document
Greetings, I've been digging in to this for two days now and have come up short - hopefully there is some simple answer I am just not seeing: I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as identically as possible (given deprecations) and indexing the same document. For most queries the results are very close (scoring within three significant differences, almost identical positions in results). However, for certain documents, the scores are very different (causing these docs to be ranked +/- 25 positions different or more in the results) In looking at debugQuery output, it seems like this is due to fieldNorm values being lower for the 3.6.0 instance than the 1.4.1. (note that for most docs, the fieldNorms are identical) I have taken the field values for the example below and run them through /admin/analysis.jsp on each solr instance. Even for the problematic docs/fields, the results are almost identical. For the example below, the t_tag values for the problematic doc: 1.4.1: 162 values 3.6.0: 164 values note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1, however, (1/0.0625)^2 = 256, which is no where near 164 Here is a particular example from 1.4.1: 1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.3750753 = idf(docFreq=27619, maxDocs=2194294) 0.078125 = fieldNorm(field=t_tag, doc=2066419) And the same from 3.6.0: 1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of: 3.8729835 = tf(termFreq(t_tag:soul)=15) 5.388126 = idf(docFreq=27740, maxDocs=2232857) 0.0625 = fieldNorm(field=t_tag, doc=1977957) Here is the 1.4.1 config for the t_tag field and text type: And 3.6.0 schema config for the t_tag field and text type: I at first got distracted by this change between versions: LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This means that terms with a position increment gap of zero do not affect the norms calculation by default. However, this doesn't appear to be causing the issue as, according to analysis.jsp there is no overlap for t_tag... Can you point me to where these fieldNorm differences are coming from and why they'd only be happing for a select few documents for which the content doesn't stand out? Thank you, Aaron