I am having trouble getting collection probabilities for a term to show up in a CustomScoreQuery/CustomScoreProvider. Basically, I am trying to add a per-document weight that amounts to the sum (for each term in the query) of Math.log(collectionProbability). Can anyone help with this?
Or feel free to suggest a better way to do this. Here's a description... ----- LMDirichletSimilarity is not consistent with the original equations, as many have noted. Here's how it's different under two 1. *Swap in LMDirichletSimilarity* in place of some other similarity, but modify the scoring function. Ignoring the boost, it is currently implemented as: term_score_current = Math.log(1 + freq / (mu * collectionProbability)) + Math.log(mu / (docLen + mu)) If you do this, there are two problems. The first problem is that the score is off by a factor of Math.log(collectionProbability). Do the math <http://en.wikipedia.org/wiki/List_of_logarithmic_identities>: if you add that in, you will get something equal to form of the original formulation (e.g., in Zhai and Lafferty 2001). For reference, that looks like: term_score_official = Math.log( (freq+mu*collectionProbability) / (docLen+mu) ) If you add that factor, though, the second problem arises. That Math.log(collectionProbability) factor does not get added for terms that don't MATCH with a document because .score() doesn't get called if there's no MATCH. This is basically the problem that Ronan Cummins wrote about a few weeks ago. 2. *Leave LMDirichletSimilarity as it is* but *add a factor* to every final score that is returned*.* (Note: you'd also need to remove the non-negative score restriction in LMDirichletSimilarity.) This would be the sum of the log collection probabilities for each term: query_score = sum(term_score_current) + sum(Math.log(collectionProbability)) As some have mentioned, this is basically an additive version of a queryNorm. It seems like the right way to do this is to wrap each query in a modified CustomScoreQuery accessing a CustomScoreProvider, which would then add that "constant" factor across all documents. However, this "constant" factor needs to be computed from statistics; how can this be done? Those statistics are available in LMDirichletSimilarity, but it is less clear how to find those statistics directly from a Query object. stephen