>From my experience the best insights are found by A/B testing different algorithms against live users and measuring relevant actions you want to see triggered by your recommender system (the number of recommended items put into a shopping cart for example).
The paper "Google News Personalization: Scalable Online Collaborative Filtering" ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf ) has a chapter about how the guys there evaluated their newly built recommender system, maybe that gives us some more ideas. --sebastian 2010/12/28 Ted Dunning <[email protected]> > > Actually, the Mahout code does address some of these issues: > > On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <[email protected]> wrote: > > > Different people watch different numbers of movies. > > > This is no problem except that with fewer movies rated (or watched if you > are using implicit feedback) then the results are less certain. > > They also rate > > some but not all. > > > Again, not a problem. > > > > Their recommendations may be in one or a few > > clusters (other clustering can be genre, which day of the week is the > > rating, on and on) or may be scattered all over genres (Harry Potter & > > British comedy & European soft-core 70's porn). > > > This isn't a problem except insofar as recommendations are a portfolio in > which getting non-zero click-through on a set of recommendations is > typically what you want, but most recommendation system optimize the > expected number of clicks. This isn't the same thing because clicks can > correlate and it helps to hedge your bets by increasing the diversity of the > recommended set. This is usually handled in an ad hoc fashion. > > > > Evaluating the worth > > of user X's ratings is also important. > > > No sure what you mean by this. There are effectively several options for > this in Mahout. > > > > If you want to interpret the > > ratings in an absolute number system, you want to map the incoming > > ratings because they may average at 7. > > > > Not sure what you mean by this. If you have ratings limited to a particular > range, then the average can't be outside that range. You may indeed want to > subtract the user mean rating for each user before building the rec data and > add back the mean for the user being recommended. Item means may be treated > the same way. This is equivalent to subtracting a rank-1 approximation of > the rankings that is derived using SVD. > > > > > > The code in Mahout doesn't address these issues. > > > > Hmmm... I think it does. Perhaps Sean can comment. > > Moving to Otis' comments: > > > > > > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic > > <[email protected]> wrote: > > > Hi, > > > > > > I was wondering how people evaluate the quality of recommendations other > > than > > > RMSE and such in eval package. > > > > Off-line evaluation is difficult. Your suggestion of MRR and related > measures is reasonable, but I prefer to count every presentation on the > first page as equivalent. > > The real problem is that historical data will only include presentations of > items from a single recommendation system. That means that any new system > that brings in new recommendations is at a disadvantage at least in terms of > error bars around the estimated click through rate. > > Another option is to compute grouped AUC for clicked items relative to > unclicked items. To do this, iterate over users with clicks. Pick a random > clicked item and a random unclicked item. Score 1 if clicked item has > higher score, 0 otherwise. Ties can be broken at random, but I prefer to > score 0 or 0.5 for them. Average score near 1 is awesome. > > I don't find it all that helpful to use the exact rank. Rather, I like to > group all impressions that are shown in the same screenful together and then > ignore second and later pages. I also prefer to measure changes in behavior > that has business value rather than just ratings.
