[
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211227#comment-16211227
]
Timothy M. Rodriguez commented on LUCENE-8000:
----------------------------------------------
+1 for keeping the existing behavior of true. It definitely struck me as weird
too, but for many indexes flipping the default would result in markedly worse
behavior. Rather than disabling discount overlaps, maybe the more ideal
behavior would be making the average document length equal to the total number
of positions across the collection divided by the number of documents? That way
we'd be comparing position length to average position length? However, I
haven't looked into the feasibility or expense of doing that. If we were able
to do that, discountOverlaps could move to something like countPositions vs
countFrequencies.
> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Christoph Goller
> Priority: Minor
>
> Length of individual documents only counts the number of positions of a
> document since discountOverlaps defaults to true.
> {quote} @Override
> public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() -
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
> return SmallFloat.intToByte4(numTerms);
> } else {
> return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
> }{quote}
> Measureing document length this way seems perfectly ok for me. What bothers
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I
> understand that sums up totalTermFreqs for all terms of a field, therefore
> counting positions of terms including those that overlap.
> {quote} protected float avgFieldLength(CollectionStatistics collectionStats)
> {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
> return 1f; // field does not exist, or stat is unsupported
> } else {
> final long docCount = collectionStats.docCount() == -1 ?
> collectionStats.maxDoc() : collectionStats.docCount();
> return (float) (sumTotalTermFreq / (double) docCount);
> }
> }{quote}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious
> effect. It just means that documents that have synonyms or in our case
> different normal forms of tokens on the same position are shorter and
> therefore get higher scores than they should and that we do not use the
> whole spectrum of relative document lenght of BM25.
> I think for BM25 discountOverlaps should default to false.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]