[
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702966#comment-16702966
]
Jan Høydahl commented on LUCENE-8563:
-------------------------------------
I think it would be a far better approach to create a new Similarity with a
distinct name (NewBM25Similarity, CleanBM25Similarity, SimplifiedBM25Similarity
or similar) for this, so Lucene users can explicitly make an informed choice,
instead of changing the implementation of the existing class. Then this issue
would not need to touch any Solr code whatsoever.
If for some reason that is not possible, I think this is a classic example of a
usecase for luceneMatchVersion conditional for Solr. If so, please create a new
8.0 *blocker* SOLR Jira issue about completing the Solr-side of things.
> Remove k1+1 from the numerator of BM25Similarity
> -------------------------------------------------
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify
> ordering. It is often omitted and I found out that the "The Probabilistic
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and
> Zaragova even describes adding (k1+1) to the numerator as a variant whose
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
> numerator of the saturation function. This is the same for all
> terms, and therefore does not affect the ranking produced.
> The reason for including it was to make the final formula
> more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score
> contributions (eg. via oal.document.FeatureField) would be a bit easier to
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%)
> rather than a term whose IDF is 3/(k1 + 1).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]