Very helpful Sebastian, thanks a lot. So now i understood: 1) compute weights and similarities with the similarity class specified 2) compute the co-occurrence between them and give the last result
Am i right? Thank you! Il giorno 18/gen/2011, alle ore 16.33, Sebastian Schelter ha scritto: > I'll give a short example how RowSimilarityJob would work on toy data: > > Say we have two items and four users: > > the ratings for itemA are [user1: 3, user2: 3, user3: 5] > the ratings for itemB are [user2: 4, user3: 2, user4: 1] > > Let's now assume we want to use the tanimoto coefficient as similarity > measure for those items, which is computed by dividing the number of shared > ratings by the overall number of ratings for both items. > > In the first step of RowSimilarityJob each of the vectors would be passed to > DistributedTanimotoCoefficientVectorSimilarity.weight(...) which would return > the number ratings for each: > > DistributedTanimotoCoefficientVectorSimilarity.weight(itemA) = 3 > DistributedTanimotoCoefficientVectorSimilarity.weight(itemB) = 3 > > In the second step, all cooccurring ratings between the item vectors are > collected, which are 3 and 4 from user2 and 5 and 2 from user3, these > together with the previously computed weights are the input to > DistributedTanimotoCoefficientVectorSimilarity.similarity(...). We now only > need to count the number of cooccurred ratings which is 2 and can compute the > tanimoto coefficient with that: tanimoto(itemA, itemB) = > numberOfCooccurredRatings / (weight(itemA) + weight(itemB) - > numberOfCooccurredRatings) = 2 / (3 + 3 - 2) = 0.5 > > > And here's how RowSimilarityJob would compute the pearson correlation between > both items: > > In the first step our weight function simply returns NaN as we don't need > that for the computation. > > In the second step we can compute the pearson correlation by only looking at > the cooccurred ratings for each item. The ratings for itemA were 3 and 5, the > ratings for itemB were 4 and 2, if we substract each ones average we get the > vectors (-1, 1) and (1,-1) for which the pearson correlation gives -1, which > is a reasonable result, because itemA was rated one lower than average by > user2 and one above average by user3 and for itemB it was exactly the other > way round, so these items seem to be negatively correlated. > > I hope I could help you a little with that. > > --sebastian > > On 18.01.2011 11:59, Stefano Bellasio wrote: >> Thank you Sebastian, just some questions to be sure of everything (im >> looking for RowSimilarityJob in my mahout installation (0.4) but without >> success, where can i find it?): >> 1) RowSimilarityJob so use a choosen similarity in the first step (for >> example Euclidean Distance) >> 2) Then for each pair of items that has a co-occurrence in the >> co-occurrence matrix it computes a similarity value with Euclidean Distance >> for example? >> >> I'm not sure about that, thank you again >> >> Il giorno 18/gen/2011, alle ore 11.32, Sebastian Schelter ha scritto: >> >>> Hi Stefano, >>> >>> AFAIK the chapter about distributed recommenders in Mahout in Action has >>> not yet been updated to the latest version of RecommenderJob maybe that's >>> the source of your confusion. >>> >>> I'll try to give a brief explanation of the similarity computation, feel >>> free to ask more questions if things don't get clear. >>> >>> RecommenderJob starts ItemSimilarityJob which creates an item x user matrix >>> from the preference data and uses RowSimilarityJob to compute the pairwise >>> similarities of the rows of this matrix (the items). So the best place to >>> start is looking at at RowSimilarityJob. >>> >>> RowSimilarityJob uses an implementation of DistributedVectorSimilarity to >>> compute the similarities in two phases. In the first phase each item-vector >>> is shown to the similarity implementation and it can compute a "weight" for >>> it. In the second phase for all pairs of rows that have at least one >>> cooccurrence the method similarity(...) is called with the formerly >>> computed weights and a list of all cooccurring values. This generic >>> approach allows us to use different implementations of >>> DistributedVectorSimilarity so we can support a wide range of similarity >>> functions. >>> >>> A simplified version of this algorithm is also explained in the slides of a >>> talk I gave at the Hadoop Get Together, maybe that's helpful too: >>> http://www.slideshare.net/sscdotopen/mahoutcf >>> >>> --sebastian >>> >>> >>> >>> On 18.01.2011 11:12, Stefano Bellasio wrote: >>>> Hi guys, im trying to understand how RecommenderJob works. Right now i was >>>> thinking that was necessary choosing a particular similarity class like >>>> Euclidean Distance and so on, so my algorithm could compute all >>>> similarities for each pair of items and produce recommendations. Reading >>>> Mahout in Action, "Distributing a Recommender" i have now some doubts >>>> about the correlation between similarities like Euclidean, LogLike, Cosine >>>> and the co-occurence matrix, as i see in RecommenderJob i can specify also >>>> "Co-occurrence" as a similarity class, so what's the correct way to >>>> compute similarities and how this happens with other similarities class >>>> and co-occurrence matrix/similarity. Thank you very much for your further >>>> explanations :) >