So looking at the tests, this makes things look less horrifying. org.apache.mahout.math.stats.LogLikelihoodTest#testLogLikelihood
assertEquals(2.772589, LogLikelihood.logLikelihoodRatio(1, 0, 0, 1), 0.000001); assertEquals(27.72589, LogLikelihood.logLikelihoodRatio(10, 0, 0, 10), 0.00001); assertEquals(39.33052, LogLikelihood.logLikelihoodRatio(5, 1995, 0, 100000), 0.00001); assertEquals(4730.737, LogLikelihood.logLikelihoodRatio(1000, 1995, 1000, 100000), 0.001); assertEquals(5734.343, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 100000), 0.001); assertEquals(5714.932, LogLikelihood.logLikelihoodRatio(1000, 1000, 1000, 99000), 0.001); Next step is to determine whether these values are correct. I recognize the first two. I put these values into my R script and got a successful load. I think that this means that the code is somehow correct, regardless of your reading of it. I don't have time right now to read the code in detail, but I think that things are working. You can find my R code at https://dl.dropboxusercontent.com/u/36863361/llr.R On Mon, Jun 3, 2013 at 3:41 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > This is a horrifying possibility. I thought we had several test cases in > place to verify this code. > > Let me look. I wonder if the code you have found is not referenced > somehow. > > > On Sun, Jun 2, 2013 at 11:23 PM, 陈文龙 <qzche...@gmail.com> wrote: > >> The definition of >> org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(long k11, >> long k12, long k21, long k22): >> >> public static double logLikelihoodRatio(long k11, long k12, long k21, >> long k22) { >> Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >> >= 0); >> // note that we have counts here, not probabilities, and that the >> entropy is not normalized. >> * double rowEntropy = entropy(k11, k12) + entropy(k21, k22);* >> * double columnEntropy = entropy(k11, k21) + entropy(k12, k22);* >> double matrixEntropy = entropy(k11, k12, k21, k22); >> if (rowEntropy + columnEntropy > matrixEntropy) { >> // round off error >> return 0.0; >> } >> return 2.0 * (matrixEntropy - rowEntropy - columnEntropy); >> } >> >> The *rowEntropy* and *columnEntropy* computed here might be *wrong*, I >> think it should be: >> >> * double rowEntropy = entropy(k11+k12, k21+k22)* >> * double columnEntropy = entropy(k11+k21, k12+k22)* >> * >> * >> which is the same as *LLR = 2 sum(k) (H(k) - H(rowSums(k)) - >> H(colSums(k))) *referred from >> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html . >> >> LLR = G2 = 2 * N * I where N is the sample size (k11 + k12 + k21 + k22 in >> this example), and I is the mutual infomation. >> >> [image: 内嵌图片 1] where x is eventA value can be 1 or 2, and y is eventB >> value canbe 1 or 2. p(x,y) = kxy/N, p(x) = p(x,1) + p(x,2). e.g. p(1,1) = >> k11/N. >> >> >> [image: 内嵌图片 2] here we get mutual_information = H(k) - H(rowSums(k)) - >> H(colSums(k) >> >> The mahout version of unnormalized entropy(k11,k12,k21,k22) = N * H(k), >> we get: >> >> *entropy(k11,k12,k21,k22) - entropy(k11+k12, k21+k22) - >> entropy(k11+k21, k12+k22) = N*(H(k) - H(rowSums(k)) - H(colSums(k))* >> * >> * >> that multiplied by 2.0 is just the LLR. >> >> Is the org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio wrong >> or have I misunderstood something? >> > >