[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834455#action_12834455
 ] 

Joaquin Perez-Iglesias commented on LUCENE-2091:
------------------------------------------------

It is a consequence of the logarithm, you can get negative numbers, and a 
negative score doesn't have to much sense. As far as I know this version of IDF 
is pretty theoretical and based on the binary independence model (BIR), so 
transform the products of probabilities into a summation of logarithms. Anyway 
it is quite usual to add a 1 to the final result before applying the logarithm 
to avoid situations like previous.

In my opinion it should be added to the patch. It doesn't hurt but it helps :-)

This stuff is clearly explained on the wikipedia 
http://en.wikipedia.org/wiki/Okapi_BM25.

Just a quote from Wikipedia
{quote}
Please note that the above formula for IDF shows potentially major drawbacks 
when using it for terms appearing in more than half of the corpus documents. 
These terms' IDF is negative, so for any two almost-identical documents, one 
which contains the term and one which does not contain it, the latter will 
possibly get a larger score. This means that terms appearing in more than half 
of the corpus will provide negative contributions to the final document score. 
This is often an undesirable behavior, so many real-world applications would 
deal with this IDF formula in a different way:

    * Each summand can be given a floor of 0, to trim out common terms;
    * The IDF function *can be given a floor of a constant ε,* to avoid common 
terms being ignored at all;
    * The IDF function can be replaced with a similarly shaped one which is 
non-negative, or strictly positive to avoid terms being ignored at all.

{quote}

> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to