Maybe we can conjure Ted to explain it, since he's the person I nicked this from, and I have trouble explaining it intuitively.
This basically explains what's going on: http://en.wikipedia.org/wiki/Likelihood-ratio_test But translating to the code below is not obvious. I can try to help: return 2.0 * (logL(k1 / n1, k1, n1) + logL(k2 / n2, k2, n2) - logL(p, k1, n1) - logL(p, k2, n2)); The four "logL" terms are really two pairs of terms. The first two represent the likelihood of an 'alternative' model that says that two users' preferences overlap not due to chance, but because they're similar. The last two represent the likelihood of a 'null' model which says the two users' aren't similar and differences are due to chance. logL is like this: return k * safeLog(p) + (n - k) * safeLog(1.0 - p); It's really trying to express: log (p^k * (1-p)^(n-k)) So for example you can understand logL(p, k1, n1) like this: p is the ratio of the number of items user 1 rated to all items (what % of all items he/she rated). It's the probability of a random item being one of the preferred items. Out of n1 trials (the number of user 2 prefs), there were k1 "successes" (that item in user 2's prefs happened to fall in the intersection, k1). There were n1-k1 failures. "Success" has probability p, "failure" has probability 1-p. The likelihood follows a binomial distribution, and probability of this outcome is p^k * (1-p)^(n-k) -- times a constant of n choose k, but we can drop it. So this term is in a sense an expression of the (log of the) likelihood that this occurred just randomly. The other terms are understood in a similar way but it takes more thought to get your head around it. I can't explain it well, Ted, help! On Thu, Mar 18, 2010 at 6:18 AM, Cui tony <[email protected]> wrote: > Hi, > I read this function in the source code of taste, and I have some > questions on the algorithm of similarity calculation: > > public double itemSimilarity(long itemID1, long itemID2) throws > TasteException { > int preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, > itemID2); > if (preferring1and2 == 0) { > return Double.NaN; > } > int preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1); > int preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2); > int numUsers = dataModel.getNumUsers(); > double logLikelihood = > twoLogLambda(preferring1and2, preferring1 - preferring1and2, > preferring2, numUsers - preferring2); > return 1.0 - 1.0 / (1.0 + logLikelihood); > } > > static double twoLogLambda(double k1, double k2, double n1, double n2) { > double p = (k1 + k2) / (n1 + n2); > return 2.0 * (logL(k1 / n1, k1, n1) + logL(k2 / n2, k2, n2) - logL(p, > k1, n1) - logL(p, k2, n2)); > } > > Is there any academic paper on this function? Why we should calculate the > similarity by the upon formula? > > Thanks! >
