Maybe we can conjure Ted to explain it, since he's the person I nicked
this from, and I have trouble explaining it intuitively.

This basically explains what's going on:
http://en.wikipedia.org/wiki/Likelihood-ratio_test

But translating to the code below is not obvious. I can try to help:

return 2.0 * (logL(k1 / n1, k1, n1)
                  + logL(k2 / n2, k2, n2)
                  - logL(p, k1, n1)
                  - logL(p, k2, n2));

The four "logL" terms are really two pairs of terms. The first two
represent the likelihood of an 'alternative' model that says that two
users' preferences overlap not due to chance, but because they're
similar. The last two represent the likelihood of a 'null' model which
says the two users' aren't similar and differences are due to chance.

logL is like this:

    return k * safeLog(p) + (n - k) * safeLog(1.0 - p);

It's really trying to express:

  log (p^k * (1-p)^(n-k))

So for example you can understand logL(p, k1, n1) like this: p is the
ratio of the number of items user 1 rated to all items (what % of all
items he/she rated). It's the probability of a random item being one
of the preferred items. Out of n1 trials (the number of user 2 prefs),
there were k1 "successes" (that item in user 2's prefs happened to
fall in the intersection, k1). There were n1-k1 failures. "Success"
has probability p, "failure" has probability 1-p. The likelihood
follows a binomial distribution, and probability of this outcome is
p^k * (1-p)^(n-k) -- times a constant of n choose k, but we can drop
it.

So this term is in a sense an expression of the (log of the)
likelihood that this occurred just randomly.

The other terms are understood in a similar way but it takes more
thought to get your head around it.


I can't explain it well, Ted, help!


On Thu, Mar 18, 2010 at 6:18 AM, Cui tony <[email protected]> wrote:
> Hi,
>  I read this function in the source code of taste, and I have some
> questions on the algorithm of similarity calculation:
>
> public double itemSimilarity(long itemID1, long itemID2) throws
> TasteException {
>    int preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1,
> itemID2);
>    if (preferring1and2 == 0) {
>      return Double.NaN;
>    }
>    int preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1);
>    int preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
>    int numUsers = dataModel.getNumUsers();
>    double logLikelihood =
>        twoLogLambda(preferring1and2, preferring1 - preferring1and2,
> preferring2, numUsers - preferring2);
>    return 1.0 - 1.0 / (1.0 + logLikelihood);
>  }
>
> static double twoLogLambda(double k1, double k2, double n1, double n2) {
>    double p = (k1 + k2) / (n1 + n2);
>    return 2.0 * (logL(k1 / n1, k1, n1) + logL(k2 / n2, k2, n2) - logL(p,
> k1, n1) - logL(p, k2, n2));
>  }
>
> Is there any academic paper on this function? Why we should calculate the
> similarity by the upon formula?
>
> Thanks!
>

Reply via email to