Hi Amit, Yes, it gives different results. However in practice, most people don't do rating prediction with Pearson coefficient, but use count-based measures like the loglikelihood ratio test.
The distributed code doesn't look at vectors of different lengths, but simply assumes non-existent ratings as zero. --sebastian On 27.11.2013 16:09, Amit Nithian wrote: > Comparing this against the non distributed (taste) gives different answers > for item item similarity as of course the non distributed looks only at > corated items. I was more wondering if this difference in practice mattered > or not. > > Also I'm confused on how you can compute the Pearson similarity between two > vectors of different length which essentially is going on here I think? > > Thanks again > Amit > On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <ssc.o...@googlemail.com> > wrote: > >> Yes, it is due to the parallel algorithm which only looks at co-ratings >> from a given user. >> >> >> On 27.11.2013 15:02, Amit Nithian wrote: >>> Thanks Sebastian! Is there a particular reason for that? >>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <ssc.o...@googlemail.com> >>> wrote: >>> >>>> Hi Amit, >>>> >>>> You are right, the non-corated items are not filtered out in the >>>> distributed implementation. >>>> >>>> --sebastian >>>> >>>> >>>> On 26.11.2013 20:51, Amit Nithian wrote: >>>>> Hi all, >>>>> >>>>> Apologies if this is a repeat question as I just joined the list but I >>>> have >>>>> a question about the way that metrics like Cosine and Pearson are >>>>> calculated in Hadoop "mode" (i.e. non Taste). >>>>> >>>>> As far as I understand, the vectors used for computing pairwise item >>>>> similarity in Taste are based on the co-rated items; however, in the >>>> Hadoop >>>>> implementation, I don't see this done. >>>>> >>>>> The implementation of the distributed item-item similarity comes from >>>> this >>>>> paper http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf. I >>>> didn't >>>>> see anything in this paper about filtering out those elements from the >>>>> vectors not co-rated and this can make a difference especially when you >>>>> normalize the ratings by dividing by the average item rating. In some >>>>> cases, the # users to divide by can be fewer depending on the >> sparseness >>>> of >>>>> the vector. >>>>> >>>>> Any clarity on this would be helpful. >>>>> >>>>> Thanks! >>>>> Amit >>>>> >>>> >>>> >>> >> >> >