Hi Amit,

Yes, it gives different results. However in practice, most people don't
do rating prediction with Pearson coefficient, but use count-based
measures like the loglikelihood ratio test.

The distributed code doesn't look at vectors of different lengths, but
simply assumes non-existent ratings as zero.

--sebastian

On 27.11.2013 16:09, Amit Nithian wrote:
> Comparing this against the non distributed (taste) gives different answers
> for item item similarity as of course the non distributed looks only at
> corated items. I was more wondering if this difference in practice mattered
> or not.
> 
> Also I'm confused on how you can compute the Pearson similarity between two
> vectors of different length which essentially is going on here I think?
> 
> Thanks again
> Amit
> On Nov 27, 2013 9:06 AM, "Sebastian Schelter" <ssc.o...@googlemail.com>
> wrote:
> 
>> Yes, it is due to the parallel algorithm which only looks at co-ratings
>> from a given user.
>>
>>
>> On 27.11.2013 15:02, Amit Nithian wrote:
>>> Thanks Sebastian! Is there a particular reason for that?
>>> On Nov 27, 2013 7:47 AM, "Sebastian Schelter" <ssc.o...@googlemail.com>
>>> wrote:
>>>
>>>> Hi Amit,
>>>>
>>>> You are right, the non-corated items are not filtered out in the
>>>> distributed implementation.
>>>>
>>>> --sebastian
>>>>
>>>>
>>>> On 26.11.2013 20:51, Amit Nithian wrote:
>>>>> Hi all,
>>>>>
>>>>> Apologies if this is a repeat question as I just joined the list but I
>>>> have
>>>>> a question about the way that metrics like Cosine and Pearson are
>>>>> calculated in Hadoop "mode" (i.e. non Taste).
>>>>>
>>>>> As far as I understand, the vectors used for computing pairwise item
>>>>> similarity in Taste are based on the co-rated items; however, in the
>>>> Hadoop
>>>>> implementation, I don't see this done.
>>>>>
>>>>> The implementation of the distributed item-item similarity comes from
>>>> this
>>>>> paper http://ssc.io/wp-content/uploads/2012/06/rec11-schelter.pdf. I
>>>> didn't
>>>>> see anything in this paper about filtering out those elements from the
>>>>> vectors not co-rated and this can make a difference especially when you
>>>>> normalize the ratings by dividing by the average item rating. In some
>>>>> cases, the # users to divide by can be fewer depending on the
>> sparseness
>>>> of
>>>>> the vector.
>>>>>
>>>>> Any clarity on this would be helpful.
>>>>>
>>>>> Thanks!
>>>>> Amit
>>>>>
>>>>
>>>>
>>>
>>
>>
> 

Reply via email to