I can try to explain one understanding of the meaning, though it is
not really the intuitive explanation of the formulation in Mahout,
rather a somewhat different one I originally used. And even 80%
understand.

Two users are similar when they rate or are associated to many of the
same items. However, a certain overlap may or may not be meaningful --
it could be due to chance, or due to the fact that we have similar
tastes. For example if you and I have rated 100 items each, and 50
overlap, we're probably similar. But if we've each rated 1000 and
overlap in only 50, maybe we're not.

The log-likelihood metric is just trying to formally quantify how
unlikely it is that our overlap is due to chance. The less likely, the
more similar we are.

So it is comparing two likelihoods, and just looking at their ratio.
The numerator likelihood is the null hypothesis: we're not similar and
overlap is due to chance. The denominator is the likelihood that it's
not at all due to chance -- that the overlap is completely explained
is perfectly explained because our tastes are similar and the overlap
is exactly what you'd expect given that.

When the numerator is relatively small, the null hypothesis is
relatively unlikely, so we are similar.

The reason the formulation typically then takes -2.0 * log (likelihood
ratio) is by convention, and it makes the result a bit more useful in
two ways. One, more similarity will equal a higher log-likelihood,
which is perhaps more intuitive than the likelihood ratio which is
lower when similarity is higher. But the real reason is that the
log-likelhood value then follows a chi-squared distribution and the
result can be used to actually figure a probability that the users are
similar or not. (We don't use it that way in Mahout though.)

And Ted' formulation, which is also right and quite tidy and the one
used in the project, is based on Shannon entropy. I understand it, I
believe, but would have to think more about an intuitive explanation.

It is, similarly, trying to figure out whether the co-occurrences are
"unusually" frequent by asking whether there is any additional
information to be gained by looking at user 1 and user 2's preferences
separately versus everything at once. If there is, then there is
something specially related about user 1 and user 2 and they're
similar.


On Mon, May 9, 2011 at 3:09 AM, Thomas Söhngen <[email protected]> wrote:
> Thank you for the explanation. I can understand the calculations now, but I
> still don't get the meaning. I think I'll try to sleep a night over it and
> try again tomorrow.
>
> Best regards,
> Thomas
>
> Am 09.05.2011 03:42, schrieb Ted Dunning:
>>
>> In this notation, k is assumed to be a matrix.  k_11 is the element in the
>> first row and first column.
>>
>> I used k to sound like count.
>>
>> The notation that you quote is R syntax.  rowSums is a function that
>> computes the row-wise sums of the argument k.  H is a function defined
>> elsewhere.
>>
>> On Sun, May 8, 2011 at 6:33 PM, Thomas Söhngen<[email protected]>  wrote:
>>
>>> Thank you for the blog post and showing me the G-test formula.
>>>
>>> After going through your blog post, I still have some open questions: You
>>> introduce k_11 to k_22, but I don't understand what "k" itself actually
>>> stands for in your formular and how the sums are defined: LLR = 2 sum(k)
>>> (H(k) - H(rowSums(k)) - H(colSums(k)))
>>>
>>> Am 09.05.2011 02:46, schrieb Ted Dunning:
>>>
>>>  My guess is that the OP was asking about the generalized log-likelihood
>>>>
>>>> ratio test used in the Mahout recommendation framework.
>>>>
>>>> That is a bit different from what you describe in that it is the log of
>>>> the
>>>> ratio of two maximum likelihoods.
>>>>
>>>> See http://en.wikipedia.org/wiki/G-test for a definition of the test
>>>> used
>>>> in
>>>> Mahout.
>>>>
>>>> On Sun, May 8, 2011 at 5:43 PM, Jeremy Lewi<[email protected]>   wrote:
>>>>
>>>>  Thomas,
>>>>>
>>>>> Are you asking a general question about log-likelihood or a specific
>>>>> implementation usage in Mahout?
>>>>>
>>>>> In general the likelihood is just a number, between 0 and 1 which
>>>>> measures the probability of observing some data under some
>>>>> distribution.
>>>>>
>>>>>
>>>>>
>

Reply via email to