[
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213255#comment-16213255
]
Robert Muir commented on LUCENE-8000:
-------------------------------------
{quote}
Robert Muir thanks for the further explanation. That helped clarify. It does
seem the effect would be minor at best. It'd be an interesting experiment at
some point, though. If I ever get to trying it, I'll post back.
{quote}
Thanks Timothy! Maybe if you get the chance to do the experiment, simply
override the method {{protected float avgFieldLength(CollectionStatistics
collectionStats)}} to return the alternative value. For experiments it can just
be a hardcoded number you computed yourself in a different way.
> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Christoph Goller
> Priority: Minor
>
> Length of individual documents only counts the number of positions of a
> document since discountOverlaps defaults to true.
> {code}
> @Override
> public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() -
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
> return SmallFloat.intToByte4(numTerms);
> } else {
> return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
> }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I
> understand that sums up totalTermFreqs for all terms of a field, therefore
> counting positions of terms including those that overlap.
> {code}
> protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
> return 1f; // field does not exist, or stat is unsupported
> } else {
> final long docCount = collectionStats.docCount() == -1 ?
> collectionStats.maxDoc() : collectionStats.docCount();
> return (float) (sumTotalTermFreq / (double) docCount);
> }
> }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious
> effect. It just means that documents that have synonyms or in my use case
> different normal forms of tokens on the same position are shorter and
> therefore get higher scores than they should and that we do not use the
> whole spectrum of relative document lenght of BM25.
> I think for BM25 discountOverlaps should default to false.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]