[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688382#comment-16688382 ]
Doug Turnbull commented on LUCENE-8563: --------------------------------------- Thanks [~jpountz] - My feeling is if Lucene has something called "BM25 Similarity" it should match to the traditional definition of BM25, and shouldn't be deprecated. But if we want to create a faster version, and make it default, I think that would be great. Or if you want to call the current (what you call legacy) "ClassicBM25Similarity" instead of legacy... I just don't feel it should be deprecated. As an IR person, I would be surprised if I was new to Lucene, looked up BM25 and it wasn't actually BM25... > Remove k1+1 from the numerator of BM25Similarity > ------------------------------------------------- > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org