[
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212772#comment-16212772
]
Robert Muir commented on LUCENE-8000:
-------------------------------------
not sure how intuitive it is, i guess maybe it kinda is if you think on a
case-by-case basis. Some examples:
* WDF splitting up "wi-fi", if those synonyms count towards doc's length, then
we punish the doc because the author wrote a hyphen (vs writing "wi fi").
* if you have 1000 synonyms for hamburger and those count towards the length,
then we punish a doc because the author wrote hamburger (versus writing
"pizza").
note that punishing a doc unfairly here punishes it for all queries. if i
search on "joker", why should one doc get a very low ranking for that term just
because the doc also happens to mention "hamburger" instead of "pizza". In this
case we have skewed length normalization in such a way that it doesn't properly
reflect verbosity.
> Document Length Normalization in BM25Similarity correct?
> --------------------------------------------------------
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Christoph Goller
> Priority: Minor
>
> Length of individual documents only counts the number of positions of a
> document since discountOverlaps defaults to true.
> {code}
> @Override
> public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() -
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
> return SmallFloat.intToByte4(numTerms);
> } else {
> return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
> }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I
> understand that sums up totalTermFreqs for all terms of a field, therefore
> counting positions of terms including those that overlap.
> {code}
> protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
> return 1f; // field does not exist, or stat is unsupported
> } else {
> final long docCount = collectionStats.docCount() == -1 ?
> collectionStats.maxDoc() : collectionStats.docCount();
> return (float) (sumTotalTermFreq / (double) docCount);
> }
> }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious
> effect. It just means that documents that have synonyms or in my use case
> different normal forms of tokens on the same position are shorter and
> therefore get higher scores than they should and that we do not use the
> whole spectrum of relative document lenght of BM25.
> I think for BM25 discountOverlaps should default to false.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]