Do search the mahout dev mailing list archives. Ted has explained LLR many times in many contexts: In terms of clustering, in terms of ngram generation and in terms of recommendations. You can also find his paper "Accurate statistics on surprise and coincidence" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96.3920 it has over 1000+ citations. No wonder Sean is not able to explain it properly, it fits a whole lot of scenarios :)
Robin On Thu, Mar 18, 2010 at 5:36 PM, Sean Owen <[email protected]> wrote: > Maybe we can conjure Ted to explain it, since he's the person I nicked > this from, and I have trouble explaining it intuitively. > > This basically explains what's going on: > http://en.wikipedia.org/wiki/Likelihood-ratio_test > > But translating to the code below is not obvious. I can try to help: > > return 2.0 * (logL(k1 / n1, k1, n1) > + logL(k2 / n2, k2, n2) > - logL(p, k1, n1) > - logL(p, k2, n2)); > > The four "logL" terms are really two pairs of terms. The first two > represent the likelihood of an 'alternative' model that says that two > users' preferences overlap not due to chance, but because they're > similar. The last two represent the likelihood of a 'null' model which > says the two users' aren't similar and differences are due to chance. > > logL is like this: > > return k * safeLog(p) + (n - k) * safeLog(1.0 - p); > > It's really trying to express: > > log (p^k * (1-p)^(n-k)) > > So for example you can understand logL(p, k1, n1) like this: p is the > ratio of the number of items user 1 rated to all items (what % of all > items he/she rated). It's the probability of a random item being one > of the preferred items. Out of n1 trials (the number of user 2 prefs), > there were k1 "successes" (that item in user 2's prefs happened to > fall in the intersection, k1). There were n1-k1 failures. "Success" > has probability p, "failure" has probability 1-p. The likelihood > follows a binomial distribution, and probability of this outcome is > p^k * (1-p)^(n-k) -- times a constant of n choose k, but we can drop > it. > > So this term is in a sense an expression of the (log of the) > likelihood that this occurred just randomly. > > The other terms are understood in a similar way but it takes more > thought to get your head around it. > > > I can't explain it well, Ted, help! > > > On Thu, Mar 18, 2010 at 6:18 AM, Cui tony <[email protected]> wrote: > > Hi, > > I read this function in the source code of taste, and I have some > > questions on the algorithm of similarity calculation: > > > > public double itemSimilarity(long itemID1, long itemID2) throws > > TasteException { > > int preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, > > itemID2); > > if (preferring1and2 == 0) { > > return Double.NaN; > > } > > int preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1); > > int preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2); > > int numUsers = dataModel.getNumUsers(); > > double logLikelihood = > > twoLogLambda(preferring1and2, preferring1 - preferring1and2, > > preferring2, numUsers - preferring2); > > return 1.0 - 1.0 / (1.0 + logLikelihood); > > } > > > > static double twoLogLambda(double k1, double k2, double n1, double n2) { > > double p = (k1 + k2) / (n1 + n2); > > return 2.0 * (logL(k1 / n1, k1, n1) + logL(k2 / n2, k2, n2) - logL(p, > > k1, n1) - logL(p, k2, n2)); > > } > > > > Is there any academic paper on this function? Why we should calculate the > > similarity by the upon formula? > > > > Thanks! > > >
