Hey Sebastian, Thanks again for the explanation. So now you have me intrigued about something else. Why is it that logliklihood ratio test is a better measure for essentially implicit ratings? Are there resources/research papers you can point me to explaining this?
Take care Amit On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter <ssc.o...@googlemail.com>wrote: > Hi Amit, > > No need to excuse for picking on me, I'm happy about anyone digging into > the paper :) > > The reason, I implemented Pearson in this (flawed) way has to do with > the way the parallel algorithm works: > > It never compares two item vectors in memory, instead it preprocesses > the vectors and computes sparse dot products in parallel. The centering > which is usually done for Pearson correlation is dependent on which pair > of vectors you're currently looking at (and doesn't fit the parallel > algorithm). We had an earlier implementation that didn't have this flaw, > but was way slower than the current one. > > Rating prediction on explicit feedback data like ratings for which > Pearson correlation is mostly used in CF, is a rather academic topic and > in science there are nearly no datasets that really require you to go to > Hadoop. > > On the other hand item prediction on implicit feedback data (like > clicks) is the common scenario in the majority of industry usecases, but > here count-based similarity measures like the loglikelihood ratio test > give much better results. The current implementation of Mahout's > distributed itembased recommender is clearly designed and tuned for the > latter usecase. > > I hope that answers your question. > > --sebastian > > On 01.12.2013 18:10, Amit Nithian wrote: > > Thanks guys! So the real question is not so much what's the average of > the > > vector with the missing rating (although yes that was a question) but > > what's the average of the vector with all the ratings specified but the > > second rating that is not shared with the first user: > > [5 - 4] vs [4 5 2]. > > > > If we agree that the first is 4.5 then is the second one 11/3 or 3 > > ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as > > 11/3. > > > > Since Taste (and Lenskit) is sequential, it can (and will only) look at > > co-occurring ratings whereas the Hadoop implementation doesn't. The paper > > that Sebastian wrote has a pre-processing step where (for Pearson) you > > subtract each element of an item-rating vector from the average rating > > which implies that each item-rating vector is treated independently of > each > > other whereas in the sequential/non-distributed mode it's all considered > > together. > > > > My main reason for posting is because the Taste implementation of > item-item > > similarity differs from the distributed implementation. Since I am > totally > > new to this space and these similarities I wanted to understand if there > is > > a reason for this difference and whether or not it matters. Sounds like > > from the discussion it doesn't matter but understanding why helps me > > explain this to others. > > > > My guess (and I'm glad Sebastian is on this list so he can help > > confirm/deny this.. sorry I'm not picking on you just happy to be able to > > talk to you about your good paper) is that considering co-occuring > ratings > > in a distributed implementation would require access to the full matrix > > which defeats the parallel nature of computing item-item similarity? > > > > Thanks again! > > Amit > > > > > > On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen <sro...@gmail.com> wrote: > > > >> It's not an issue of how to be careful with sparsity and subtracting > >> means, although that's a valuable point in itself. The question is > >> what the mean is supposed to be. > >> > >> You can't think of missing ratings as 0 in general, and the example > >> here shows why: you're acting as if most movies are hated. Instead > >> they are excluded from the computation entirely. > >> > >> m_x should be 4.5 in the example here. That's consistent with > >> literature and the other implementations earlier in this project. > >> > >> I don't know the Hadoop implementation well enough, and wasn't sure > >> from the comments above, whether it does end up behaving as if it's > >> "4.5" or "3". If it's not 4.5 I would call that a bug. Items that > >> aren't co-rated can't meaningfully be included in this computation. > >> > >> > >> On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > >>> Good point Amit. > >>> > >>> Not sure how much this matters. It may be that > >>> PearsonCorrelationSimilarity is bad name that should be > >>> PearonInspiredCorrelationSimilarity. My guess is that this > >> implementation > >>> is lifted directly from the very early recommendation literature and is > >>> reflective of the way that it was used back then. > >> > > > >