[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Christoph Goller (JIRA) Mon, 23 Oct 2017 01:18:26 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214805#comment-16214805
 ]


Christoph Goller edited comment on LUCENE-8000 at 10/23/17 8:17 AM:
--------------------------------------------------------------------

??As an additional point, advanced use cases often utilize token "stacking" for 
additional uses as well and these would have further distortions on length.??

That's exactly what we are doing. Therefore using discountOverlaps = false 
could punish languages with more different word forms. I also prefer 
discountOverlaps = true. I have an intern (student) working on relevance tuning 
and benchmarks. I think we can try overwriting 
{code:java}
protected float avgFieldLength(CollectionStatistics collectionStats)
{code}
 and see it it changes anything. We will also have a look into Lucene benchmark 
module.

Thanks for your feedback.


was (Author: gol...@detego-software.de):
??As an additional point, advanced use cases often utilize token "stacking" for 
additional uses as well and these would have further distortions on length. ??

That's exactly what we are doing. Therefore using discountOverlaps = false 
could punish languages with more different word forms. I also prefer 
discountOverlaps = true. I have an intern (student) working on relevance tuning 
and benchmarks. I think we can try overwriting 
{code:java}
protected float avgFieldLength(CollectionStatistics collectionStats)
{code}
 and see it it changes anything. We will also have a look into Lucene benchmark 
module.

Thanks for your feedback.

> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
>                 Key: LUCENE-8000
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8000
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Christoph Goller
>            Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
>     final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
>     int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
>     if (indexCreatedVersionMajor >= 7) {
>       return SmallFloat.intToByte4(numTerms);
>     } else {
>       return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
>     }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
>     final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
>     if (sumTotalTermFreq <= 0) {
>       return 1f;       // field does not exist, or stat is unsupported
>     } else {
>       final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>       return (float) (sumTotalTermFreq / (double) docCount);
>     }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

Reply via email to