[
https://issues.apache.org/jira/browse/LUCENE-3934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241475#comment-13241475
]
Andrzej Bialecki commented on LUCENE-3934:
-------------------------------------------
Eh, it's even worse - the
[http://www.dc.fi.udc.es/~barreiro/publications/blanco_barreiro_ecir2007.pdf|paper]
that we used as a reference is buggy itself :) or at least misleading.
Formula 1 that supposedly gives the Robertson-Sparck-Jones normalization of idf
should really read (according to
[http://terrierteam.dcs.gla.ac.uk/publications/rtlo_DIRpaper.pdf|its authors]:
{code}
IDF = log ( ((D - df) + 0.5) / (df + 0.5) )
or: IDF = - log ( (df + 0.5) / ((D - df) + 0.5) )
{code}
As it's presented in the Blanco-Barreiro paper it would be invalid (for some
values the argument to log() would be negative).
At this point I wasn't sure about the Formula 2 in Blanco-Barreiro, because
going by the definition it should be a difference between the observed IDF -
that is, the one that is calculated in Formula 1 - and an expected estimate
based on a Poisson model, denoted as expIDF. Whereas the Formula 2 seemed
different... After searching the literature for a while I found
[http://www.cstr.ed.ac.uk/downloads/publications/2007/48920155.pdf|another
paper] by Murray-Renals where a formula for RIDF is presented clearly enough
for math-challenged people like me:
{code}
expIDF = - log ( 1 - e^(-totalFreq/D) )
RIDF = IDF - expIDF
{code}
So, to summarize, the Formula 2 in the Blanco-Barreiro paper should look
something like this:
{code}
RIDF = log(((D - df) + 0.5) / (df + 0.5)) + log( 1 - e^(-totalFreq/D) )
or: RIDF = -log((df + 0.5) / ((D - df) + 0.5)) + log( 1 - e^(-totalFreq/D) )
{code}
Now, comparing to the original formula from the Blanco-Barreiro paper we can
clearly see that it is similar, but it differs in the way it calculates IDF:
{code}
RIDF = - log(df/D) + log(1 - e^(-totalFreq/D)) (Formula 2)
{code}
Which means that even though they mention the Robertson-Sparck-Jones
normalization they don't use it (and neither do Murray and Renals in their
paper).
To summarize, I think the Formula 2 is correct, and our code has to be fixed.
Patch is coming shortly, I need to write a unit test.
> Residual IDF calculation in the pruning package is wrong
> --------------------------------------------------------
>
> Key: LUCENE-3934
> URL: https://issues.apache.org/jira/browse/LUCENE-3934
> Project: Lucene - Java
> Issue Type: Bug
> Affects Versions: 3.5, 3.6
> Reporter: Andrzej Bialecki
> Assignee: Andrzej Bialecki
>
> As discussed on the mailing list
> (http://markmail.org/message/cwnyfqmet3wophec) there seems to be a bug in
> both the formula and in the way RIDF is calculated. The formula is missing a
> minus, but also the calculation uses local (in-document) term frequency
> instead of the total term frequency (sum of all term occurrences in a corpus).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]