That's right. I used to have separate implementations. This might be a good question to ask the experts: while based on my understanding of the issues, it seems a bit better to compute the cosine measure based on centered (mean = 0) data, I wonder if there are good arguments for not doing this?
Instead of computing the cosine of the angle between user vectors A and B, it would be computing the cosine of the angle between A+m_A and B+m_B, where m_A and m_B are vectors whose entries are just the average of elements in A and B respectively. It pushes the A and B endpoints out in the direction of (1,1,...1) or (-1,-1,...,-1). This narrows the angle, and makes similarities tends towards 1. In fact for the case where ratings are in, say, [0,10], the angle never gets past 90 degrees, so the similarity is in [0,1] and not even [-1,1]. It feels like it's losing a bit of dynamic range, but that's got to be a minor issue. That is the greater the average preference, the less difference in preference will matter to the similarity measure. The only aspect of this I don't like is that differences in preferences at the small end of the range matter much more, which doesn't seem intuitively right. The similarity between users who rated two movies (0,1) and (1,0) is as low as possible -- 0 -- while the similarity between users who rated two movies (9,10) and (10,9) is nearly 1. But in both cases they rated two movies quite similarly on a scale of 0 to 10. With centering, the result would have been identical. On Wed, Mar 3, 2010 at 12:19 AM, Tamas Jambor <[email protected]> wrote: > sure, if you center the data then they are identical. but the uncentered > cosine similarity is quite different, as far as I know. > > T > > On 02/03/2010 22:55, Sean Owen wrote: >> >> Yes, that's also the Pearson-correlation-based one, since it forces >> the data to be 'centered' (mean of 0) during the computation. In that >> case they are actually identical. >> >> On Tue, Mar 2, 2010 at 10:47 PM, Tamas Jambor<[email protected]> >> wrote: >> >>> >>> Thanks. that makes sense. Which one would be cosine similarity? do you >>> have >>> that implemented? >>> > >
