[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Doug Turnbull (JIRA) Thu, 15 Nov 2018 09:12:11 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16688382#comment-16688382
 ]


Doug Turnbull commented on LUCENE-8563:
---------------------------------------

Thanks [~jpountz] - My feeling is if Lucene has something called "BM25 
Similarity" it should match to the traditional definition of BM25, and 
shouldn't be deprecated. But if we want to create a faster version, and make it 
default, I think that would be great.

Or if you want to call the current (what you call legacy) 
"ClassicBM25Similarity" instead of legacy... 

I just don't feel it should be deprecated. As an IR person, I would be 
surprised if I was new to Lucene, looked up BM25 and it wasn't actually BM25...

> Remove k1+1 from the numerator of  BM25Similarity
> -------------------------------------------------
>
>                 Key: LUCENE-8563
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8563
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Reply via email to