It's not an issue of how to be careful with sparsity and subtracting means, although that's a valuable point in itself. The question is what the mean is supposed to be.
You can't think of missing ratings as 0 in general, and the example here shows why: you're acting as if most movies are hated. Instead they are excluded from the computation entirely. m_x should be 4.5 in the example here. That's consistent with literature and the other implementations earlier in this project. I don't know the Hadoop implementation well enough, and wasn't sure from the comments above, whether it does end up behaving as if it's "4.5" or "3". If it's not 4.5 I would call that a bug. Items that aren't co-rated can't meaningfully be included in this computation. On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Good point Amit. > > Not sure how much this matters. It may be that > PearsonCorrelationSimilarity is bad name that should be > PearonInspiredCorrelationSimilarity. My guess is that this implementation > is lifted directly from the very early recommendation literature and is > reflective of the way that it was used back then.