Actually, the Mahout code does address some of these issues: On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <[email protected]> wrote:
> Different people watch different numbers of movies. This is no problem except that with fewer movies rated (or watched if you are using implicit feedback) then the results are less certain. They also rate > some but not all. Again, not a problem. > Their recommendations may be in one or a few > clusters (other clustering can be genre, which day of the week is the > rating, on and on) or may be scattered all over genres (Harry Potter & > British comedy & European soft-core 70's porn). This isn't a problem except insofar as recommendations are a portfolio in which getting non-zero click-through on a set of recommendations is typically what you want, but most recommendation system optimize the expected number of clicks. This isn't the same thing because clicks can correlate and it helps to hedge your bets by increasing the diversity of the recommended set. This is usually handled in an ad hoc fashion. > Evaluating the worth > of user X's ratings is also important. No sure what you mean by this. There are effectively several options for this in Mahout. > If you want to interpret the > ratings in an absolute number system, you want to map the incoming > ratings because they may average at 7. > Not sure what you mean by this. If you have ratings limited to a particular range, then the average can't be outside that range. You may indeed want to subtract the user mean rating for each user before building the rec data and add back the mean for the user being recommended. Item means may be treated the same way. This is equivalent to subtracting a rank-1 approximation of the rankings that is derived using SVD. > > The code in Mahout doesn't address these issues. > Hmmm... I think it does. Perhaps Sean can comment. Moving to Otis' comments: > > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic > <[email protected]> wrote: > > Hi, > > > > I was wondering how people evaluate the quality of recommendations other > than > > RMSE and such in eval package. > Off-line evaluation is difficult. Your suggestion of MRR and related measures is reasonable, but I prefer to count every presentation on the first page as equivalent. The real problem is that historical data will only include presentations of items from a single recommendation system. That means that any new system that brings in new recommendations is at a disadvantage at least in terms of error bars around the estimated click through rate. Another option is to compute grouped AUC for clicked items relative to unclicked items. To do this, iterate over users with clicks. Pick a random clicked item and a random unclicked item. Score 1 if clicked item has higher score, 0 otherwise. Ties can be broken at random, but I prefer to score 0 or 0.5 for them. Average score near 1 is awesome. I don't find it all that helpful to use the exact rank. Rather, I like to group all impressions that are shown in the same screenful together and then ignore second and later pages. I also prefer to measure changes in behavior that has business value rather than just ratings.
