[ https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16212772#comment-16212772 ]
Robert Muir commented on LUCENE-8000: ------------------------------------- not sure how intuitive it is, i guess maybe it kinda is if you think on a case-by-case basis. Some examples: * WDF splitting up "wi-fi", if those synonyms count towards doc's length, then we punish the doc because the author wrote a hyphen (vs writing "wi fi"). * if you have 1000 synonyms for hamburger and those count towards the length, then we punish a doc because the author wrote hamburger (versus writing "pizza"). note that punishing a doc unfairly here punishes it for all queries. if i search on "joker", why should one doc get a very low ranking for that term just because the doc also happens to mention "hamburger" instead of "pizza". In this case we have skewed length normalization in such a way that it doesn't properly reflect verbosity. > Document Length Normalization in BM25Similarity correct? > -------------------------------------------------------- > > Key: LUCENE-8000 > URL: https://issues.apache.org/jira/browse/LUCENE-8000 > Project: Lucene - Core > Issue Type: Bug > Reporter: Christoph Goller > Priority: Minor > > Length of individual documents only counts the number of positions of a > document since discountOverlaps defaults to true. > {code} > @Override > public final long computeNorm(FieldInvertState state) { > final int numTerms = discountOverlaps ? state.getLength() - > state.getNumOverlap() : state.getLength(); > int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor(); > if (indexCreatedVersionMajor >= 7) { > return SmallFloat.intToByte4(numTerms); > } else { > return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms))); > } > }} > {code} > Measureing document length this way seems perfectly ok for me. What bothers > me is that > average document length is based on sumTotalTermFreq for a field. As far as I > understand that sums up totalTermFreqs for all terms of a field, therefore > counting positions of terms including those that overlap. > {code} > protected float avgFieldLength(CollectionStatistics collectionStats) { > final long sumTotalTermFreq = collectionStats.sumTotalTermFreq(); > if (sumTotalTermFreq <= 0) { > return 1f; // field does not exist, or stat is unsupported > } else { > final long docCount = collectionStats.docCount() == -1 ? > collectionStats.maxDoc() : collectionStats.docCount(); > return (float) (sumTotalTermFreq / (double) docCount); > } > } > } > {code} > Are we comparing apples and oranges in the final scoring? > I haven't run any benchmarks and I am not sure whether this has a serious > effect. It just means that documents that have synonyms or in my use case > different normal forms of tokens on the same position are shorter and > therefore get higher scores than they should and that we do not use the > whole spectrum of relative document lenght of BM25. > I think for BM25 discountOverlaps should default to false. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org