[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213255#comment-16213255 ]
Robert Muir commented on LUCENE-8000: ------------------------------------- {quote} Robert Muir thanks for the further explanation. That helped clarify. It does seem the effect would be minor at best. It'd be an interesting experiment at some point, though. If I ever get to trying it, I'll post back. {quote} Thanks Timothy! Maybe if you get the chance to do the experiment, simply override the method {{protected float avgFieldLength(CollectionStatistics collectionStats)}} to return the alternative value. For experiments it can just be a hardcoded number you computed yourself in a different way. > Document Length Normalization in BM25Similarity correct? > -------------------------------------------------------- > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug > Reporter: Christoph Goller > Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {code} > @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }} > {code} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {code} > protected float avgFieldLength(CollectionStatistics collectionStats) { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > } > } > {code} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in my use case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org