Hi,

I've noticed a problem in the non-Hadoop (taste) version of the
recommender package. The problem is in the AbstractSimilarity (in
package org.apache.mahout.cf.taste.impl.similarity).

This class is the base class for computing the similarity values
between vectors of users or items. It assumes that the similarity
between the vectors is computed using only the commonly rated
items/users.

Consider the following two vectors:
V1: <_, 3, 4, _, 2>
V2: <3, 5, _, 2, 4>

where "_" means no ratings. For these two vectors, the cosine or
Pearson similarity is computed on the following vectors:

<3, 2>
<5, 4>

However, if the number of common ratings is small then the similarity
result will be very unreliable. Which is indeed the case if you run
the code on Movielens dataset and measure recall values, the results
will be very bad.

There can be two solutions:
1. There should be a parameter n, which determines the minimum number
of common ratings needed to compute a similarity otherwise the system
should return NaN.
2. The similarity should be computed using all the ratings, for the
above two vectors, the cosine similarity should be

(3*5+2*4)/(sqrt(3^2+4^2+2^2)+sqrt(3^2+5^2+2^2+4^2))

Tevfik

Reply via email to