Do search the mahout dev mailing list archives. Ted has explained LLR many
times in many contexts: In terms of clustering, in terms of ngram generation
and in terms of recommendations. You can also find his paper "Accurate
statistics on surprise and coincidence"
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96.3920 it has over
1000+ citations. No wonder Sean is not able to explain it properly, it fits
a whole lot of scenarios :)

Robin


On Thu, Mar 18, 2010 at 5:36 PM, Sean Owen <[email protected]> wrote:

> Maybe we can conjure Ted to explain it, since he's the person I nicked
> this from, and I have trouble explaining it intuitively.
>
> This basically explains what's going on:
> http://en.wikipedia.org/wiki/Likelihood-ratio_test
>
> But translating to the code below is not obvious. I can try to help:
>
> return 2.0 * (logL(k1 / n1, k1, n1)
>                  + logL(k2 / n2, k2, n2)
>                  - logL(p, k1, n1)
>                  - logL(p, k2, n2));
>
> The four "logL" terms are really two pairs of terms. The first two
> represent the likelihood of an 'alternative' model that says that two
> users' preferences overlap not due to chance, but because they're
> similar. The last two represent the likelihood of a 'null' model which
> says the two users' aren't similar and differences are due to chance.
>
> logL is like this:
>
>    return k * safeLog(p) + (n - k) * safeLog(1.0 - p);
>
> It's really trying to express:
>
>  log (p^k * (1-p)^(n-k))
>
> So for example you can understand logL(p, k1, n1) like this: p is the
> ratio of the number of items user 1 rated to all items (what % of all
> items he/she rated). It's the probability of a random item being one
> of the preferred items. Out of n1 trials (the number of user 2 prefs),
> there were k1 "successes" (that item in user 2's prefs happened to
> fall in the intersection, k1). There were n1-k1 failures. "Success"
> has probability p, "failure" has probability 1-p. The likelihood
> follows a binomial distribution, and probability of this outcome is
> p^k * (1-p)^(n-k) -- times a constant of n choose k, but we can drop
> it.
>
> So this term is in a sense an expression of the (log of the)
> likelihood that this occurred just randomly.
>
> The other terms are understood in a similar way but it takes more
> thought to get your head around it.
>
>
> I can't explain it well, Ted, help!
>
>
> On Thu, Mar 18, 2010 at 6:18 AM, Cui tony <[email protected]> wrote:
> > Hi,
> >  I read this function in the source code of taste, and I have some
> > questions on the algorithm of similarity calculation:
> >
> > public double itemSimilarity(long itemID1, long itemID2) throws
> > TasteException {
> >    int preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1,
> > itemID2);
> >    if (preferring1and2 == 0) {
> >      return Double.NaN;
> >    }
> >    int preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1);
> >    int preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
> >    int numUsers = dataModel.getNumUsers();
> >    double logLikelihood =
> >        twoLogLambda(preferring1and2, preferring1 - preferring1and2,
> > preferring2, numUsers - preferring2);
> >    return 1.0 - 1.0 / (1.0 + logLikelihood);
> >  }
> >
> > static double twoLogLambda(double k1, double k2, double n1, double n2) {
> >    double p = (k1 + k2) / (n1 + n2);
> >    return 2.0 * (logL(k1 / n1, k1, n1) + logL(k2 / n2, k2, n2) - logL(p,
> > k1, n1) - logL(p, k2, n2));
> >  }
> >
> > Is there any academic paper on this function? Why we should calculate the
> > similarity by the upon formula?
> >
> > Thanks!
> >
>

Reply via email to