Re: Understanding similaraties computation in RecommenderJob

Stefano Bellasio Wed, 19 Jan 2011 02:48:19 -0800

Very helpful Sebastian, thanks a lot. So now i understood: 1) compute weights 
and similarities with the similarity class specified 2) compute the 
co-occurrence between them and give the last result


Am i right? Thank you!
Il giorno 18/gen/2011, alle ore 16.33, Sebastian Schelter ha scritto:

> I'll give a short example how RowSimilarityJob would work on toy data:
> 
> Say we have two items and four users:
> 
> the ratings for itemA are [user1: 3, user2: 3, user3: 5]
> the ratings for itemB are [user2: 4, user3: 2, user4: 1]
> 
> Let's now assume we want to use the tanimoto coefficient as similarity 
> measure for those items, which is computed by dividing the number of shared 
> ratings by the overall number of ratings for both items.
> 
> In the first step of RowSimilarityJob each of the vectors would be passed to 
> DistributedTanimotoCoefficientVectorSimilarity.weight(...) which would return 
> the number ratings for each:
> 
> DistributedTanimotoCoefficientVectorSimilarity.weight(itemA) = 3
> DistributedTanimotoCoefficientVectorSimilarity.weight(itemB) = 3
> 
> In the second step, all cooccurring ratings between the item vectors are 
> collected, which are 3 and 4 from user2 and 5 and 2 from user3,  these 
> together with the previously computed weights are the input to 
> DistributedTanimotoCoefficientVectorSimilarity.similarity(...). We now only 
> need to count the number of cooccurred ratings which is 2 and can compute the 
> tanimoto coefficient with that: tanimoto(itemA, itemB) =  
> numberOfCooccurredRatings / (weight(itemA) + weight(itemB) - 
> numberOfCooccurredRatings) = 2 / (3 + 3 - 2) = 0.5
> 
> 
> And here's how RowSimilarityJob would compute the pearson correlation between 
> both items:
> 
> In the first step our weight function simply returns NaN as we don't need 
> that for the computation.
> 
> In the second step we can compute the pearson correlation by only looking at 
> the cooccurred ratings for each item. The ratings for itemA were 3 and 5, the 
> ratings for itemB were 4 and 2, if we substract each ones average we get the 
> vectors (-1, 1) and (1,-1) for which the pearson correlation gives -1, which 
> is a reasonable result, because itemA was rated one lower than average by 
> user2 and one above average by user3 and for itemB it was exactly the other 
> way round, so these items seem to be negatively correlated.
> 
> I hope I could help you a little with that.
> 
> --sebastian
> 
> On 18.01.2011 11:59, Stefano Bellasio wrote:
>> Thank you Sebastian, just some questions to be sure of everything (im 
>> looking for RowSimilarityJob in my mahout installation (0.4) but without 
>> success, where can i find it?):
>> 1) RowSimilarityJob so use a choosen similarity in the first step (for 
>> example Euclidean Distance)
>> 2) Then for each pair of items that has  a co-occurrence in the 
>> co-occurrence matrix it computes a similarity value with Euclidean Distance 
>> for example?
>> 
>> I'm not sure about that, thank you again
>> 
>> Il giorno 18/gen/2011, alle ore 11.32, Sebastian Schelter ha scritto:
>> 
>>> Hi Stefano,
>>> 
>>> AFAIK the chapter about distributed recommenders in Mahout in Action has 
>>> not yet been updated to the latest version of RecommenderJob maybe that's 
>>> the source of your confusion.
>>> 
>>> I'll try to give a brief explanation of the similarity computation, feel 
>>> free to ask more questions if things don't get clear.
>>> 
>>> RecommenderJob starts ItemSimilarityJob which creates an item x user matrix 
>>> from the preference data and uses RowSimilarityJob to compute the pairwise 
>>> similarities of the rows of this matrix (the items). So the best place to 
>>> start is looking at at RowSimilarityJob.
>>> 
>>> RowSimilarityJob uses an implementation of DistributedVectorSimilarity to 
>>> compute the similarities in two phases. In the first phase each item-vector 
>>> is shown to the similarity implementation and it can compute a "weight" for 
>>> it. In the second phase for all pairs of rows that have at least one 
>>> cooccurrence the method similarity(...) is called with the formerly 
>>> computed weights and a list of all cooccurring values. This generic 
>>> approach allows us to use different implementations of 
>>> DistributedVectorSimilarity so we can support a wide range of similarity 
>>> functions.
>>> 
>>> A simplified version of this algorithm is also explained in the slides of a 
>>> talk I gave at the Hadoop Get Together, maybe that's helpful too: 
>>> http://www.slideshare.net/sscdotopen/mahoutcf
>>> 
>>> --sebastian
>>> 
>>> 
>>> 
>>> On 18.01.2011 11:12, Stefano Bellasio wrote:
>>>> Hi guys, im trying to understand how RecommenderJob works. Right now i was 
>>>> thinking that was necessary choosing a particular similarity class like 
>>>> Euclidean Distance and so on, so my algorithm could compute all 
>>>> similarities for each pair of items and produce recommendations. Reading 
>>>> Mahout in Action, "Distributing a Recommender" i have now some doubts 
>>>> about the correlation between similarities like Euclidean, LogLike, Cosine 
>>>> and the co-occurence matrix, as i see in RecommenderJob i can specify also 
>>>> "Co-occurrence" as a similarity class, so what's the correct way to 
>>>> compute similarities and how this happens with other similarities class 
>>>> and co-occurrence matrix/similarity. Thank you very much for your further 
>>>> explanations :)
>

Re: Understanding similaraties computation in RecommenderJob

Reply via email to