I'll give a short example how RowSimilarityJob would work on toy data:

Say we have two items and four users:

the ratings for itemA are [user1: 3, user2: 3, user3: 5]
the ratings for itemB are [user2: 4, user3: 2, user4: 1]

Let's now assume we want to use the tanimoto coefficient as similarity measure for those items, which is computed by dividing the number of shared ratings by the overall number of ratings for both items.

In the first step of RowSimilarityJob each of the vectors would be passed to DistributedTanimotoCoefficientVectorSimilarity.weight(...) which would return the number ratings for each:

DistributedTanimotoCoefficientVectorSimilarity.weight(itemA) = 3
DistributedTanimotoCoefficientVectorSimilarity.weight(itemB) = 3

In the second step, all cooccurring ratings between the item vectors are collected, which are 3 and 4 from user2 and 5 and 2 from user3, these together with the previously computed weights are the input to DistributedTanimotoCoefficientVectorSimilarity.similarity(...). We now only need to count the number of cooccurred ratings which is 2 and can compute the tanimoto coefficient with that: tanimoto(itemA, itemB) = numberOfCooccurredRatings / (weight(itemA) + weight(itemB) - numberOfCooccurredRatings) = 2 / (3 + 3 - 2) = 0.5


And here's how RowSimilarityJob would compute the pearson correlation between both items:

In the first step our weight function simply returns NaN as we don't need that for the computation.

In the second step we can compute the pearson correlation by only looking at the cooccurred ratings for each item. The ratings for itemA were 3 and 5, the ratings for itemB were 4 and 2, if we substract each ones average we get the vectors (-1, 1) and (1,-1) for which the pearson correlation gives -1, which is a reasonable result, because itemA was rated one lower than average by user2 and one above average by user3 and for itemB it was exactly the other way round, so these items seem to be negatively correlated.

I hope I could help you a little with that.

--sebastian

On 18.01.2011 11:59, Stefano Bellasio wrote:
Thank you Sebastian, just some questions to be sure of everything (im looking 
for RowSimilarityJob in my mahout installation (0.4) but without success, where 
can i find it?):
1) RowSimilarityJob so use a choosen similarity in the first step (for example 
Euclidean Distance)
2) Then for each pair of items that has  a co-occurrence in the co-occurrence 
matrix it computes a similarity value with Euclidean Distance for example?

I'm not sure about that, thank you again

Il giorno 18/gen/2011, alle ore 11.32, Sebastian Schelter ha scritto:

Hi Stefano,

AFAIK the chapter about distributed recommenders in Mahout in Action has not 
yet been updated to the latest version of RecommenderJob maybe that's the 
source of your confusion.

I'll try to give a brief explanation of the similarity computation, feel free 
to ask more questions if things don't get clear.

RecommenderJob starts ItemSimilarityJob which creates an item x user matrix 
from the preference data and uses RowSimilarityJob to compute the pairwise 
similarities of the rows of this matrix (the items). So the best place to start 
is looking at at RowSimilarityJob.

RowSimilarityJob uses an implementation of DistributedVectorSimilarity to compute the 
similarities in two phases. In the first phase each item-vector is shown to the 
similarity implementation and it can compute a "weight" for it. In the second 
phase for all pairs of rows that have at least one cooccurrence the method 
similarity(...) is called with the formerly computed weights and a list of all 
cooccurring values. This generic approach allows us to use different implementations of 
DistributedVectorSimilarity so we can support a wide range of similarity functions.

A simplified version of this algorithm is also explained in the slides of a 
talk I gave at the Hadoop Get Together, maybe that's helpful too: 
http://www.slideshare.net/sscdotopen/mahoutcf

--sebastian



On 18.01.2011 11:12, Stefano Bellasio wrote:
Hi guys, im trying to understand how RecommenderJob works. Right now i was thinking that was 
necessary choosing a particular similarity class like Euclidean Distance and so on, so my algorithm 
could compute all similarities for each pair of items and produce recommendations. Reading Mahout 
in Action, "Distributing a Recommender" i have now some doubts about the correlation 
between similarities like Euclidean, LogLike, Cosine and the co-occurence matrix, as i see in 
RecommenderJob i can specify also "Co-occurrence" as a similarity class, so what's the 
correct way to compute similarities and how this happens with other similarities class and 
co-occurrence matrix/similarity. Thank you very much for your further explanations :)

Reply via email to