I'll give a short example how RowSimilarityJob would work on toy data:
Say we have two items and four users:
the ratings for itemA are [user1: 3, user2: 3, user3: 5]
the ratings for itemB are [user2: 4, user3: 2, user4: 1]
Let's now assume we want to use the tanimoto coefficient as similarity
measure for those items, which is computed by dividing the number of
shared ratings by the overall number of ratings for both items.
In the first step of RowSimilarityJob each of the vectors would be
passed to DistributedTanimotoCoefficientVectorSimilarity.weight(...)
which would return the number ratings for each:
DistributedTanimotoCoefficientVectorSimilarity.weight(itemA) = 3
DistributedTanimotoCoefficientVectorSimilarity.weight(itemB) = 3
In the second step, all cooccurring ratings between the item vectors are
collected, which are 3 and 4 from user2 and 5 and 2 from user3, these
together with the previously computed weights are the input to
DistributedTanimotoCoefficientVectorSimilarity.similarity(...). We now
only need to count the number of cooccurred ratings which is 2 and can
compute the tanimoto coefficient with that: tanimoto(itemA, itemB) =
numberOfCooccurredRatings / (weight(itemA) + weight(itemB) -
numberOfCooccurredRatings) = 2 / (3 + 3 - 2) = 0.5
And here's how RowSimilarityJob would compute the pearson correlation
between both items:
In the first step our weight function simply returns NaN as we don't
need that for the computation.
In the second step we can compute the pearson correlation by only
looking at the cooccurred ratings for each item. The ratings for itemA
were 3 and 5, the ratings for itemB were 4 and 2, if we substract each
ones average we get the vectors (-1, 1) and (1,-1) for which the pearson
correlation gives -1, which is a reasonable result, because itemA was
rated one lower than average by user2 and one above average by user3 and
for itemB it was exactly the other way round, so these items seem to be
negatively correlated.
I hope I could help you a little with that.
--sebastian
On 18.01.2011 11:59, Stefano Bellasio wrote:
Thank you Sebastian, just some questions to be sure of everything (im looking
for RowSimilarityJob in my mahout installation (0.4) but without success, where
can i find it?):
1) RowSimilarityJob so use a choosen similarity in the first step (for example
Euclidean Distance)
2) Then for each pair of items that has a co-occurrence in the co-occurrence
matrix it computes a similarity value with Euclidean Distance for example?
I'm not sure about that, thank you again
Il giorno 18/gen/2011, alle ore 11.32, Sebastian Schelter ha scritto:
Hi Stefano,
AFAIK the chapter about distributed recommenders in Mahout in Action has not
yet been updated to the latest version of RecommenderJob maybe that's the
source of your confusion.
I'll try to give a brief explanation of the similarity computation, feel free
to ask more questions if things don't get clear.
RecommenderJob starts ItemSimilarityJob which creates an item x user matrix
from the preference data and uses RowSimilarityJob to compute the pairwise
similarities of the rows of this matrix (the items). So the best place to start
is looking at at RowSimilarityJob.
RowSimilarityJob uses an implementation of DistributedVectorSimilarity to compute the
similarities in two phases. In the first phase each item-vector is shown to the
similarity implementation and it can compute a "weight" for it. In the second
phase for all pairs of rows that have at least one cooccurrence the method
similarity(...) is called with the formerly computed weights and a list of all
cooccurring values. This generic approach allows us to use different implementations of
DistributedVectorSimilarity so we can support a wide range of similarity functions.
A simplified version of this algorithm is also explained in the slides of a
talk I gave at the Hadoop Get Together, maybe that's helpful too:
http://www.slideshare.net/sscdotopen/mahoutcf
--sebastian
On 18.01.2011 11:12, Stefano Bellasio wrote:
Hi guys, im trying to understand how RecommenderJob works. Right now i was thinking that was
necessary choosing a particular similarity class like Euclidean Distance and so on, so my algorithm
could compute all similarities for each pair of items and produce recommendations. Reading Mahout
in Action, "Distributing a Recommender" i have now some doubts about the correlation
between similarities like Euclidean, LogLike, Cosine and the co-occurence matrix, as i see in
RecommenderJob i can specify also "Co-occurrence" as a similarity class, so what's the
correct way to compute similarities and how this happens with other similarities class and
co-occurrence matrix/similarity. Thank you very much for your further explanations :)