Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Robert Muir
On Thu, Jul 19, 2012 at 12:10 AM, Aaron Daubman daub...@gmail.com wrote:
 Greetings,

 I've been digging in to this for two days now and have come up short -
 hopefully there is some simple answer I am just not seeing:

 I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
 identically as possible (given deprecations) and indexing the same document.

Why did you do this? If you want the exact same scoring, use the exact
same analysis.
This means specifying luceneMatchVersion = 2.9, and the exact same
analysis components (even if deprecated).

 I have taken the field values for the example below and run them
 through /admin/analysis.jsp on each solr instance. Even for the problematic
 docs/fields, the results are almost identical. For the example below, the
 t_tag values for the problematic doc:
 1.4.1: 162 values
 3.6.0: 164 values


This is why: you changed your analysis.

-- 
lucidimagination.com


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Aaron Daubman
Robert,

 I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
  identically as possible (given deprecations) and indexing the same
 document.

 Why did you do this? If you want the exact same scoring, use the exact
 same analysis.
 This means specifying luceneMatchVersion = 2.9, and the exact same
 analysis components (even if deprecated).

  I have taken the field values for the example below and run them
  through /admin/analysis.jsp on each solr instance. Even for the
 problematic
  docs/fields, the results are almost identical. For the example below, the
  t_tag values for the problematic doc:
  1.4.1: 162 values
  3.6.0: 164 values
 

 This is why: you changed your analysis.


Apologies if I didn't clearly state my goal/concern: I am not looking for
the exact same scoring - I am looking to explain scoring differences.
 Deprecated components will eventually go away, time moves on, etc...
etc... I would like to be able to run current code, and should be able to -
the part that is sticking is being able to *explain* the difference in
results.

As you can see from my email, after running the different analysis on the
input, the output does not demonstrate (in any way that I can see) why the
fieldNorm values would be so different. Even with the different analysis,
the results are almost identical - which *should* result in an almost
identical fieldNorm???

Again, the desire is not to be the same, it is to understand the difference.

Thanks,
 Aaron


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Robert Muir
On Thu, Jul 19, 2012 at 11:11 AM, Aaron Daubman daub...@gmail.com wrote:

 Apologies if I didn't clearly state my goal/concern: I am not looking for
 the exact same scoring - I am looking to explain scoring differences.
  Deprecated components will eventually go away, time moves on, etc...
 etc... I would like to be able to run current code, and should be able to -
 the part that is sticking is being able to *explain* the difference in
 results.


OK: i totally missed that, sorry!

to explain why you see such a large difference:

The difference is that these length normalizations are computed at
index time and fit inside a *single byte* by default. This is to keep
ram usage low for many documents and many fields with norms (since its
#fieldsWithNorms * #documents in bytes in ram).
So this is lossy: basically you can think of there being only 256
possible values. So when you increased the number of terms only
slightly by changing your analysis, this happened to bump you over the
edge rounding you up to the next value.

more information:
http://lucene.apache.org/core/3_6_0/scoring.html
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

by the way: if you don't like this:
1. if you can still live with a single byte, maybe plug in your own
Similarity class into 3.6, overriding decodeNormValue/encodeNormValue.
For example, you could use a different SmallFloat configuration that
has less range but more precision for your use case (if your docs are
all short or whatever)
2. otherwise, if you feel you need more than a single byte, check out
4.0-ALPHA: you arent limited to a single byte there.

-- 
lucidimagination.com


Re: Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-19 Thread Aaron Daubman
Robert,

So this is lossy: basically you can think of there being only 256
 possible values. So when you increased the number of terms only
 slightly by changing your analysis, this happened to bump you over the
 edge rounding you up to the next value.

 more information:
 http://lucene.apache.org/core/3_6_0/scoring.html

 http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html



Thanks - this was extremely helpful! I had read both sources before but
didn't grasp the magnitude of lossy-ness until your pointer and mention of
edge-case.
Just to help out anybody else who might run in to this, I hacked together a
little harness to demonstrate:
---
fieldLength: 160, computeNorm: 0.07905694, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 161, computeNorm: 0.07881104, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 162, computeNorm: 0.07856742, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 163, computeNorm: 0.07832605, floatToByte315: 109,
byte315ToFloat: 0.078125
fieldLength: 164, computeNorm: 0.07808688, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 165, computeNorm: 0.077849895, floatToByte315: 108,
byte315ToFloat: 0.0625
fieldLength: 166, computeNorm: 0.07761505, floatToByte315: 108,
byte315ToFloat: 0.0625
---

So my takeaway is that these scores that vary significantly are caused by:
1) a field with lengths right on this boundary between the two analyzer
chains
2) the fact that we might be searching for matches from 50+ values to a
field with 150+ values, and so the overall score is repeatedly impacted by
the otherwise typically insignificant change in fieldNorm value

Thanks again,
 Aaron


Frustrating differences in fieldNorm between two different versions of solr indexing the same document

2012-07-18 Thread Aaron Daubman
Greetings,

I've been digging in to this for two days now and have come up short -
hopefully there is some simple answer I am just not seeing:

I have a solr 1.4.1 instance and a solr 3.6.0 instance, both configured as
identically as possible (given deprecations) and indexing the same document.

For most queries the results are very close (scoring within three
significant differences, almost identical positions in results).

However, for certain documents, the scores are very different (causing
these docs to be ranked +/- 25 positions different or more in the results)

In looking at debugQuery output, it seems like this is due to fieldNorm
values being lower for the 3.6.0 instance than the 1.4.1.

(note that for most docs, the fieldNorms are identical)

I have taken the field values for the example below and run them
through /admin/analysis.jsp on each solr instance. Even for the problematic
docs/fields, the results are almost identical. For the example below, the
t_tag values for the problematic doc:
1.4.1: 162 values
3.6.0: 164 values

note that 1/sqrt(162) = 0.07857 ~= fieldNorm for 1.4.1,
however, (1/0.0625)^2 = 256, which is no where near 164

Here is a particular example from 1.4.1:
1.6263733 = (MATCH) fieldWeight(t_tag:soul in 2066419), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.3750753 = idf(docFreq=27619, maxDocs=2194294)
   0.078125 = fieldNorm(field=t_tag, doc=2066419)

And the same from 3.6.0:
1.3042576 = (MATCH) fieldWeight(t_tag:soul in 1977957), product of:
   3.8729835 = tf(termFreq(t_tag:soul)=15)
   5.388126 = idf(docFreq=27740, maxDocs=2232857)
   0.0625 = fieldNorm(field=t_tag, doc=1977957)


Here is the 1.4.1 config for the t_tag field and text type:
fieldtype name=text class=solr.TextField
positionIncrementGap=100
  analyzer
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.StandardFilterFactory/
  filter class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true/
  filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
  /analyzer
  /fieldtype
dynamicField name=t_* type=text indexed=true stored=true
required=false multiValued=true termVectors=true/


And 3.6.0 schema config for the t_tag field and text type:
fieldtype name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
analyzer
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.ASCIIFoldingFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=stopwords.txt ignoreCase=true/
filter class=solr.KeywordMarkerFilterFactory
protected=protwords.txt/
filter class=solr.PorterStemFilterFactory/
/analyzer
/fieldtype
field name=t_tag type=text indexed=true stored=true
required=false multiValued=true/

I at first got distracted by this change between versions:
LUCENE-2286: Enabled DefaultSimilarity.setDiscountOverlaps by default. This
means that terms with a position increment gap of zero do not affect the
norms calculation by default.
However, this doesn't appear to be causing the issue as, according to
analysis.jsp there is no overlap for t_tag...

Can you point me to where these fieldNorm differences are coming from and
why they'd only be happing for a select few documents for which the content
doesn't stand out?

Thank you,
 Aaron