Thanks, but the "sweet spot" in our metadata cannot be derived from its length. I'm rather searching for a similarity class that uses more than one byte to encode its field norm. I keep on digging ;-) ________________________________________ From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Erik Hatcher [erikhatc...@mac.com] Sent: Thursday, September 26, 2013 3:07 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] solr computation field norm problem
Nicolas - Lucene 4 still encodes norms, as described here: <http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/similarities/DefaultSimilarity.html#encodeNormValue%28float%29> using this function: <http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/util/SmallFloat.html#floatToByte315%28float%29> You might want to give SweetSpotSimilarity a try: <http://lucene.apache.org/core/4_4_0/misc/org/apache/lucene/misc/SweetSpotSimilarity.html> Erik On Sep 26, 2013, at 8:02 AM, Nicolas Franck <nicolas.fra...@ugent.be> wrote: > I've been testing with Solr 4 (Lucene 4) that uses the new DefaultSimilarity > class. > It does not use the "encodeNorm" and "decodeNorm" methods anymore that > caused all the trouble (storing the floats as a single byte). But it doesn't > change anything? > The field norms remain the same? > ________________________________________ > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Chris > Fitzpatrick [chrisfitz...@gmail.com] > Sent: Wednesday, September 25, 2013 7:57 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] solr computation field norm problem > > Yeah...I think you're running into this: > > http://lucene.472066.n3.nabble.com/field-length-normalization-tp495308p495311.html > > TL;DR: > Jay Hill says fields with 3 terms and 4 terms both score at .5 in the > lengthNorm. > > > > > > > > On Wed, Sep 25, 2013 at 4:21 PM, Nicolas Franck > <nicolas.fra...@ugent.be>wrote: > >> Hi there, >> >> I have a question about the way Lucene computes the length norm of field >> norm for its documents. >> My documents are indexed using Solr. >> These are the documents that where indexed (ignore 'score', that is not >> part of the document itself) >> >> <doc> >> <float name="score">1.00711</float> >> <str name="_id">ejn01:2560000000075596</str> >> <str name="title">Journal of neurology research</str> >> </doc> >> <doc> >> <float name="score">1.00711</float> >> <str name="_id">ejn01:954925518616</str> >> <str name="title">Journal of neurology</str> >> </doc> >> >> >> The field "title" has the following definition in schema.xml: >> >> <fieldType name="utf8text" class="solr.TextField" >> positionIncrementGap="100" omitNorms="false"> >> <analyzer type="index"> >> <tokenizer class="solr.StandardTokenizerFactory" >> maxTokenLength="1024"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.ASCIIFoldingFilterFactory"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" format="solr" ignoreCase="false" >> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.StandardTokenizerFactory" >> maxTokenLength="1024"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.ASCIIFoldingFilterFactory"/> >> <filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" format="solr" ignoreCase="false" >> expand="true" tokenizerFactory="solr.WhitespaceTokenizerFactory"/> >> </analyzer> >> </fieldType> >> >> >> If I use the query "journal of neurology", both documents have the same >> score, although the second document is more exact. Supplying a phrase query >> does not fix the issue. I also see that the computed fieldNorm is "0.5" for >> both documents. Does this have something to do with the loss of precision >> when storing the length norm into one byte? >> >> These are all the supplied parameters (defaults in solrconfig.xml): >> >> <str name="lowercaseOperators">false</str> >> <str name="mm">-10%</str> >> <str name="pf">author^3 title^2</str> >> <str name="sort">score desc</str> >> <arr name="bq"> >> <str>source:ser01^10</str> >> <str>source:ejn01^10</str> >> <str>(*:* -type:article)^999</str> >> </arr> >> <str name="echoParams">all</str> >> <str name="df">all</str> >> <str name="tie">0</str> >> <str name="qf"> >> author^15 title^10 subject^1 summary^1 library^1 location^1 publisher^1 >> place_published^1 issn^1 isbn^1 >> </str> >> <str name="q.alt">*:*</str> >> <str name="ps">2</str> >> <str name="defType">edismax</str> >> <str name="q">journal of neurology</str> >> <str name="echoParams">all</str> >> <str name="sort">score desc</str> >> >> Looking the computation of the score, I see no single difference between >> them (see down below) >> Any idea why the fieldNorm is the same for both documents? >> >> >> Thanks in advance! >> >> Greetings, >> >> Nicolas >> >> >> >> >> <str name="ejn01:2560000000075596"> >> 1.0071099 = (MATCH) sum of: >> 0.0053001107 = (MATCH) sum of: >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:journal^10.0 in 0), product of: >> 0.005943145 = queryWeight(title:journal^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:journal in 0), product of: >> 1.0 = tf(termFreq(title:journal)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=0) >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:of^10.0 in 0), product of: >> 0.005943145 = queryWeight(title:of^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:of in 0), product of: >> 1.0 = tf(termFreq(title:of)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=0) >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:neurology^10.0 in 0), product of: >> 0.005943145 = queryWeight(title:neurology^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:neurology in 0), product of: >> 1.0 = tf(termFreq(title:neurology)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=0) >> 0.0031800664 = (MATCH) max of: >> 0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in >> 0), product of: >> 0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0), >> product of: >> 2.0 = boost >> 1.7836046 = idf(title: journal=2 of=2 neurology=2) >> 9.996294E-4 = queryNorm >> 0.8918023 = fieldWeight(title:"journal of neurology" in 0), product >> of: >> 1.0 = tf(phraseFreq=1.0) >> 1.7836046 = idf(title: journal=2 of=2 neurology=2) >> 0.5 = fieldNorm(field=title, doc=0) >> 0.99862975 = (MATCH) sum of: >> 0.99862975 = (MATCH) MatchAllDocsQuery, product of: >> 0.99862975 = queryNorm >> </str> >> <str name="ejn01:954925518616"> >> 1.0071099 = (MATCH) sum of: >> 0.0053001107 = (MATCH) sum of: >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:journal^10.0 in 1), product of: >> 0.005943145 = queryWeight(title:journal^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:journal in 1), product of: >> 1.0 = tf(termFreq(title:journal)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=1) >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:of^10.0 in 1), product of: >> 0.005943145 = queryWeight(title:of^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:of in 1), product of: >> 1.0 = tf(termFreq(title:of)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=1) >> 0.0017667036 = (MATCH) max of: >> 0.0017667036 = (MATCH) weight(title:neurology^10.0 in 1), product of: >> 0.005943145 = queryWeight(title:neurology^10.0), product of: >> 10.0 = boost >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 9.996294E-4 = queryNorm >> 0.29726744 = (MATCH) fieldWeight(title:neurology in 1), product of: >> 1.0 = tf(termFreq(title:neurology)=1) >> 0.5945349 = idf(docFreq=2, maxDocs=2) >> 0.5 = fieldNorm(field=title, doc=1) >> 0.0031800664 = (MATCH) max of: >> 0.0031800664 = (MATCH) weight(title:"journal of neurology"~2^2.0 in >> 1), product of: >> 0.0035658872 = queryWeight(title:"journal of neurology"~2^2.0), >> product of: >> 2.0 = boost >> 1.7836046 = idf(title: journal=2 of=2 neurology=2) >> 9.996294E-4 = queryNorm >> 0.8918023 = fieldWeight(title:"journal of neurology" in 1), product >> of: >> 1.0 = tf(phraseFreq=1.0) >> 1.7836046 = idf(title: journal=2 of=2 neurology=2) >> <b>0.5 = fieldNorm(field=title, doc=1) >> 0.99862975 = (MATCH) sum of: >> 0.99862975 = (MATCH) MatchAllDocsQuery, product of: >> 0.99862975 = queryNorm >> </str> >>