Hi, Can anyone help me understand the scoring function in the LMDirichletSimilarity class? The scoring function in LMDirichletSimilarity is shown below: ------------------------------------------------------------------------------------------- float score = stats.getTotalBoost() * (float)( Math.log(1 + freq /(mu * ((LMStats)stats).getCollectionProbability())) + Math.log(mu / (docLen + mu)) ); ------------------------------------------------------------------------------------------- The math formula of the highlighted part above is log[ (tf + mu * P(w|C)) / (docLen + mu) / P(w|C)], which, in terms of scoring, should be equivalent to ------------------------------------------------------------------------------------------- return score = (float) ( (freq + mu * ((LMStats)stats).getCollectionProbability()) / (docLen + mu) ); ------------------------------------------------------------------------------------------- which is written exactly according to textbook/paper because the division by P(w|C) is same for all documents. However, I'm getting much worse results by using the second piece of code.
Can anyone help explain why this is happening? Am I missing something about the scoring? Thanks, Dong