Re: [CODE4LIB] solr computation field norm problem

Nicolas Franck Tue, 01 Oct 2013 03:35:22 -0700

Thanks, but the "sweet spot" in our metadata cannot be derived from its length.
I'm rather searching for a similarity class that uses more than one byte to 
encode
its field norm. I keep on digging ;-)
________________________________________
From: Code for Libraries [[email protected]] On Behalf Of Erik Hatcher 
[[email protected]]
Sent: Thursday, September 26, 2013 3:07 PM
To: [email protected]
Subject: Re: [CODE4LIB] solr computation field norm problem


Nicolas -

Lucene 4 still encodes norms, as described here:

<http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#encodeNormValue%28float%29>

using this function:

<http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/SmallFloat.html#floatToByte315%28float%29>

You might want to give SweetSpotSimilarity a try: 
<http://lucene.apache.org/core/4_4_0/misc/org/apache/lucene/misc/SweetSpotSimilarity.html>

        Erik


On Sep 26, 2013, at 8:02 AM, Nicolas Franck <[email protected]> wrote:

> I've been testing with Solr 4 (Lucene 4) that uses the new DefaultSimilarity 
> class.
> It does not use the "encodeNorm" and "decodeNorm" methods anymore that
> caused all the trouble (storing the floats as a single byte). But it doesn't 
> change anything?
> The field norms remain the same?
> ________________________________________
> From: Code for Libraries [[email protected]] On Behalf Of Chris 
> Fitzpatrick [[email protected]]
> Sent: Wednesday, September 25, 2013 7:57 PM
> To: [email protected]
> Subject: Re: [CODE4LIB] solr computation field norm problem
>
> Yeah...I think you're running into this:
>
> http://lucene.472066.n3.nabble.com/field-length-normalization-tp495308p495311.html
>
> TL;DR:
> Jay Hill says fields with 3 terms and 4 terms both score at .5 in the
> lengthNorm.
>
>
>
>
>
>
>
> On Wed, Sep 25, 2013 at 4:21 PM, Nicolas Franck 
> <[email protected]>wrote:
>
>> Hi there,
>>
>> I have a question about the way Lucene computes the length norm of field
>> norm for its documents.
>> My documents are indexed using Solr.
>> These are the documents that where indexed (ignore 'score', that is not
>> part of the document itself)
>>
>> <doc>
>>  <float name="score">1.00711</float>
>>  <str name="_id">ejn01:2560000000075596</str>
>>  <str name="title">Journal of neurology research</str>
>> </doc>
>> <doc>
>>  <float name="score">1.00711</float>
>>  <str name="_id">ejn01:954925518616</str>
>>  <str name="title">Journal of neurology</str>
>> </doc>
>>
>>
>> The field "title" has the following definition in schema.xml:
>>
>> <fieldType name="utf8text" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="false">
>>  <analyzer type="index">
>>    <tokenizer class="solr.StandardTokenizerFactory"
>> maxTokenLength="1024"/>
>>    <filter class="solr.LowerCaseFilterFactory"/>
>>    <filter class="solr.ASCIIFoldingFilterFactory"/>
>>    <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" format="solr" ignoreCase="false"
>> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
>>  </analyzer>
>>  <analyzer type="query">
>>    <tokenizer class="solr.StandardTokenizerFactory"
>> maxTokenLength="1024"/>
>>    <filter class="solr.LowerCaseFilterFactory"/>
>>    <filter class="solr.ASCIIFoldingFilterFactory"/>
>>    <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" format="solr" ignoreCase="false"
>> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/>
>>  </analyzer>
>> </fieldType>
>>
>>
>> If I use the query "journal of neurology", both documents have the same
>> score, although the second document is more exact. Supplying a phrase query
>> does not fix the issue. I also see that the computed fieldNorm is "0.5" for
>> both documents. Does this have something to do with the loss of precision
>> when storing the length norm into one byte?
>>
>> These are all the supplied parameters (defaults in solrconfig.xml):
>>
>> <str name="lowercaseOperators">false</str>
>> <str name="mm">-10%</str>
>> <str name="pf">author^3 title^2</str>
>> <str name="sort">score desc</str>
>> <arr name="bq">
>>  <str>source:ser01^10</str>
>>  <str>source:ejn01^10</str>
>> <str>(*:* -type:article)^999</str>
>> </arr>
>> <str name="echoParams">all</str>
>> <str name="df">all</str>
>> <str name="tie">0</str>
>> <str name="qf">
>> author^15 title^10 subject^1 summary^1 library^1 location^1 publisher^1
>> place_published^1 issn^1 isbn^1
>> </str>
>> <str name="q.alt">*:*</str>
>> <str name="ps">2</str>
>> <str name="defType">edismax</str>
>> <str name="q">journal of neurology</str>
>> <str name="echoParams">all</str>
>> <str name="sort">score desc</str>
>>
>> Looking the computation of the score, I see no single difference between
>> them (see down below)
>> Any idea why the fieldNorm is the same for both documents?
>>
>>
>> Thanks in advance!
>>
>> Greetings,
>>
>> Nicolas
>>
>>
>>
>>
>> <str name="ejn01:2560000000075596">
>> 1.0071099 = (MATCH) sum of:
>>  0.0053001107 = (MATCH) sum of:
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:journal^10.0 in 0), product of:
>>        0.005943145 = queryWeight(title:journal^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:journal in 0), product of:
>>          1.0 = tf(termFreq(title:journal)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=0)
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:of^10.0 in 0), product of:
>>        0.005943145 = queryWeight(title:of^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:of in 0), product of:
>>          1.0 = tf(termFreq(title:of)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=0)
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:neurology^10.0 in 0), product of:
>>        0.005943145 = queryWeight(title:neurology^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:neurology in 0), product of:
>>          1.0 = tf(termFreq(title:neurology)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=0)
>>  0.0031800664 = (MATCH) max of:
>>    0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in
>> 0), product of:
>>      0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0),
>> product of:
>>        2.0 = boost
>>        1.7836046 = idf(title: journal=2 of=2 neurology=2)
>>        9.996294E-4 = queryNorm
>>      0.8918023 = fieldWeight(title:"journal of neurology" in 0), product
>> of:
>>        1.0 = tf(phraseFreq=1.0)
>>        1.7836046 = idf(title: journal=2 of=2 neurology=2)
>>        0.5 = fieldNorm(field=title, doc=0)
>>  0.99862975 = (MATCH) sum of:
>>    0.99862975 = (MATCH) MatchAllDocsQuery, product of:
>>      0.99862975 = queryNorm
>> </str>
>> <str name="ejn01:954925518616">
>> 1.0071099 = (MATCH) sum of:
>>  0.0053001107 = (MATCH) sum of:
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:journal^10.0 in 1), product of:
>>        0.005943145 = queryWeight(title:journal^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:journal in 1), product of:
>>          1.0 = tf(termFreq(title:journal)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=1)
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:of^10.0 in 1), product of:
>>        0.005943145 = queryWeight(title:of^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:of in 1), product of:
>>          1.0 = tf(termFreq(title:of)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=1)
>>    0.0017667036 = (MATCH) max of:
>>      0.0017667036 = (MATCH) weight(title:neurology^10.0 in 1), product of:
>>        0.005943145 = queryWeight(title:neurology^10.0), product of:
>>          10.0 = boost
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          9.996294E-4 = queryNorm
>>        0.29726744 = (MATCH) fieldWeight(title:neurology in 1), product of:
>>          1.0 = tf(termFreq(title:neurology)=1)
>>          0.5945349 = idf(docFreq=2, maxDocs=2)
>>          0.5 = fieldNorm(field=title, doc=1)
>>  0.0031800664 = (MATCH) max of:
>>    0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in
>> 1), product of:
>>      0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0),
>> product of:
>>        2.0 = boost
>>        1.7836046 = idf(title: journal=2 of=2 neurology=2)
>>        9.996294E-4 = queryNorm
>>      0.8918023 = fieldWeight(title:"journal of neurology" in 1), product
>> of:
>>        1.0 = tf(phraseFreq=1.0)
>>        1.7836046 = idf(title: journal=2 of=2 neurology=2)
>>        <b>0.5 = fieldNorm(field=title, doc=1)
>>  0.99862975 = (MATCH) sum of:
>>    0.99862975 = (MATCH) MatchAllDocsQuery, product of:
>>      0.99862975 = queryNorm
>> </str>
>>

Re: [CODE4LIB] solr computation field norm problem

Reply via email to