Hi,
 
Can anyone help me understand the scoring function in the LMDirichletSimilarity 
class? 
 
The scoring function in LMDirichletSimilarity is shown below:
-------------------------------------------------------------------------------------------
float score = stats.getTotalBoost() * (float)(
    Math.log(1 + freq /(mu * ((LMStats)stats).getCollectionProbability())) +
 
    Math.log(mu / (docLen + mu))
);
-------------------------------------------------------------------------------------------
 
The math formula of the highlighted part above is log[ (tf + mu * P(w|C)) / 
(docLen + mu) / P(w|C)], which, in terms of scoring, should be equivalent to 
-------------------------------------------------------------------------------------------
return score = (float) ( (freq + mu * 
((LMStats)stats).getCollectionProbability()) / (docLen + mu) ); 
-------------------------------------------------------------------------------------------
which is written exactly according to textbook/paper because the division by 
P(w|C) is same for all documents. However, I'm getting much worse results by 
using the second piece of code.

Can anyone help explain why this is happening? Am I missing something about the 
scoring?
 
 
Thanks,
Dong

Reply via email to