Re: Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Sean Owen Sun, 08 May 2011 12:17:16 -0700

No, not in the distributed version. I think this would be a bad idea
performance-wise. Suddenly a sparse user-item matrix becomes dense --
a million data points can become a trillion.

However I think you can sort of get what you want by implementing a
variant on the Euclidean implementation that would "pretend" that
these values were filled in.

Say two items overlap in A users (both are "1" for those A users), and
then there are B other users that exist for one or the other item but
not both. The Euclidean distance between them is sqrt(B). If I recall
correctly from what Sebastian did, the "weight" arguments you get in
doComputeResult() are the occurrences of each item. And "A" is just
the number of cooccurrences you get in this method. So B = weightA +
weightB - 2 * (# of cooccurrences), and the Euclidean distance as you
want it is sqrt(B).

To get a similarity measure, you could use 1 / (1 + sqrt(B)). We use a
different formulation, A / (1 + sqrt(B)) which avoids penalizing two
vectors which have a lot of items in common, as they'd otherwise have
more chance of being farther away.

I'll point out that the TanimotoCoefficientSimilarity implementation,
which is also available to you, computes A / (A+B). Not the same, but
similar, so may be something you could use.

But really, log-likelihood is a good default choice.

On Sun, May 8, 2011 at 7:23 PM, Thomas Söhngen <[email protected]> wrote:
> Is there a way to tell Mahout to "fill-up" the user-item matrix with zeros,
> when no rating is given for a user, item combination? I asume distance would
> become meaningful again then.
>
> Do you have any suggestions for scientific sources helping to choose an
> appropiate similarity function?
>
> Regards
> Thomas
>
>
> Am 08.05.2011 19:57, schrieb Sean Owen:
>>
>> All preferences are "1" in your world. Therefore user vectors are
>> always like (1,1,...,1). The distance between any two is 0, and the
>> similarity is 1. This metric is not appropriate for binary data. The
>> closest thing to what I think you want is the
>> TanimotoCoefficientsimilarity, but also try LogLikelihoodSimilarity.
>>
>> Yes, if you have a range of ratings, not just 1, it becomes meaningful
>> again to look at distance as a similarity metric.
>>
>> Sean
>>
>> On Sun, May 8, 2011 at 5:37 PM, Thomas Söhngen<[email protected]>  wrote:
>>>
>>> Hello everyone,
>>>
>>> I am calculating similiar items with the SIMILARITY_EUCLIDEAN_DISTANCE
>>> class. My input is binary data, users clicking a like button. The output
>>> only generates similarities with a similarity score of "1". It doesn't
>>> calculate all items similiar to each other, but for the items it finds a
>>> similarity, the output is always "1". Why is this?
>>>
>>> I don't have the problem, when I also add a "dislike" information, with
>>> input lines "item_id,user_id,1" for a Like interaction and
>>> "item_id,user_id,-1" for dislikes. The similarity lies between 0 and 1
>>> then.
>>>
>>> Regards and thanks for suggestions,
>>> Thomas
>>>
>

Re: Why does SIMILARITY_EUCLIDEAN_DISTANCE only generates outputs with a similarity score of "1" for binary input?

Reply via email to