[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

Joaquin Perez-Iglesias (JIRA) Thu, 03 Dec 2009 13:34:50 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785538#action_12785538
 ]


Joaquin Perez-Iglesias commented on LUCENE-2091:
------------------------------------------------

Hi everybody,

I'm going to try to answer some of your questions,  when I started to develop 
this library I didn't want
to modify the Lucene code, moreover I tried to create a jar that could be 
straight added  to the official
Lucene distribution. That is the main reason why there are some duplicated 
classes.
So yes it would be better a tigher integration, and I believe we will get more 
support for different query types.

In relation with BM25 or BM25F they are equivalent, BM25F is the version for 
more than a field, so yes go for BM25F.
What it is really important is the way boost factors are applied, as you can 
see in the equation these must be applied to raw frequencies and not to 
normalized frequencies or saturated frequencies. 
(Currently Lucene is doing it after normalization and saturation of 
frequencies, what in my opinion is not the best approach.)
A more detailed explanation of BM25F and this issue can be found in this paper 
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.5255

The problem, as I said, comes from IDF. In the BM25 equations family, IDF is 
always computed at document level (that is why
I recommend as heuristic to use the field with more terms, or use an special 
field that contains all the terms). As far as I know that is a problem
because Lucene doesn't store the document frequency per document but per field.

Otis is right as far as I know just changing similarity is not enough, some 
data is not available to TermScorer neither similarity and TermScorer
apply the obtained values from similarity in a way that make it incompatible 
with BM25.
It is really important to follow the steps as it appears in my explanation:

1. Normalize frequencies with document/field length and b factor.
2. Saturate the effect of frequency with k1 
3. Compute summatory of terms weights
4. Apply IDF

I really believe that this can be done (not sure how), so maybe we will need 
the suggestions of some 'scorer guru'.

> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2091) Add BM25 Scoring to Lucene

Reply via email to