[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Adrien Grand (JIRA) Wed, 14 Nov 2018 15:08:57 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687260#comment-16687260
 ]


Adrien Grand commented on LUCENE-8563:
--------------------------------------

bq. "assuming a single similarity" – is this something that we want to assume?

We can't indeed, even though this is the most common case. That said if you are 
searching multiple fields at once today, the I'm afraid that relevance isn't 
very good anyway as we don't support something like BM25F (LUCENE-8216) to 
merge index and document statistics (BlendedTermQuery merges index statistics, 
but not norms and term frequencies). By the way BM25F doesn't allow to 
configure the value of k1 on a per-field basis, only b may have different 
per-field values.

bq. I'm sure this change would be appropriate for some scenarios, but it's a 
fundamental change that could in some cases have significant downstream 
consequences, with no easy way (as far as I can tell) to maintain existing 
behavior.

Users could multiply their per-field boosts by (k1+1)?

> Remove k1+1 from the numerator of  BM25Similarity
> -------------------------------------------------
>
>                 Key: LUCENE-8563
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8563
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

Reply via email to