On Mon, Dec 27, 2010 at 4:24 PM, Sebastian Schelter <[email protected]> wrote:
> From my experience the best insights are found by A/B testing > different algorithms against live users and measuring relevant actions > you want to see triggered by your recommender system (the number of > recommended items put into a shopping cart for example). > Amen to this. I only addressed off-line evaluation, but on-line evaluation is far better if you have sufficient traffic. Generally, offline testing is only usable to weed out totally useless options and A/B testing is required for more realistic assessment. > > > > > > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic > > > <[email protected]> wrote: > > > > Hi, > > > > > > > > I was wondering how people evaluate the quality of recommendations > other > > > than > > > > RMSE and such in eval package. > > > > > > > Off-line evaluation is difficult. Your suggestion of MRR and related > > measures is reasonable, but I prefer to count every presentation on the > > first page as equivalent. > > > > The real problem is that historical data will only include presentations > of > > items from a single recommendation system. That means that any new > system > > that brings in new recommendations is at a disadvantage at least in terms > of > > error bars around the estimated click through rate. > > > > Another option is to compute grouped AUC for clicked items relative to > > unclicked items. To do this, iterate over users with clicks. Pick a > random > > clicked item and a random unclicked item. Score 1 if clicked item has > > higher score, 0 otherwise. Ties can be broken at random, but I prefer to > > score 0 or 0.5 for them. Average score near 1 is awesome. > > > > I don't find it all that helpful to use the exact rank. Rather, I like > to > > group all impressions that are shown in the same screenful together and > then > > ignore second and later pages. I also prefer to measure changes in > behavior > > that has business value rather than just ratings. >
