Re: Doc length nomalization in Lucene LM

Ahmet Arslan Fri, 22 Jul 2016 01:17:10 -0700

Hi Roy,

It is about storing the document length into a byte (to use less memory).
Please edit the source code to avoid this encode/decode thing:


/**
* Encodes the document length in a lossless way
*/
@Override
public long computeNorm(FieldInvertState state) {
return state.getLength() - state.getNumOverlap();
}

@Override
public float score(int doc, float freq) {
// We have to supply something in case norms are omitted
return ModelBase.this.score(stats, freq,
norms == null ? 1L : norms.get(doc));
}

@Override
public Explanation explain(int doc, Explanation freq) {
return ModelBase.this.explain(stats, doc, freq,
norms == null ? 1L : norms.get(doc));
}



On Thursday, July 21, 2016 6:06 PM, Dwaipayan Roy <dwaipayan....@gmail.com> 
wrote:



Hello,

In *SimilarityBase.java*, I can see that the length of the document is is
getting normalized by using the function *decodeNormValue()*. But I can't
understand how the normalizations is done. Can you please help? Also, is
there any way to avoid this doc-length normalization, to use the raw
doc-length (as used in LM-JM Zhai et al. SIGIR-2001)?

Thanks..

P.S. I am using Lucene 4.10.4

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Doc length nomalization in Lucene LM

Reply via email to