[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834455#action_12834455 ]
Joaquin Perez-Iglesias commented on LUCENE-2091: ------------------------------------------------ It is a consequence of the logarithm, you can get negative numbers, and a negative score doesn't have to much sense. As far as I know this version of IDF is pretty theoretical and based on the binary independence model (BIR), so transform the products of probabilities into a summation of logarithms. Anyway it is quite usual to add a 1 to the final result before applying the logarithm to avoid situations like previous. In my opinion it should be added to the patch. It doesn't hurt but it helps :-) This stuff is clearly explained on the wikipedia http://en.wikipedia.org/wiki/Okapi_BM25. Just a quote from Wikipedia {quote} Please note that the above formula for IDF shows potentially major drawbacks when using it for terms appearing in more than half of the corpus documents. These terms' IDF is negative, so for any two almost-identical documents, one which contains the term and one which does not contain it, the latter will possibly get a larger score. This means that terms appearing in more than half of the corpus will provide negative contributions to the final document score. This is often an undesirable behavior, so many real-world applications would deal with this IDF formula in a different way: * Each summand can be given a floor of 0, to trim out common terms; * The IDF function *can be given a floor of a constant ε,* to avoid common terms being ignored at all; * The IDF function can be replaced with a similarly shaped one which is non-negative, or strictly positive to avoid terms being ignored at all. {quote} > Add BM25 Scoring to Lucene > -------------------------- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Yuval Feinstein > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2091.patch, persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org