>From my experience the best insights are found by A/B testing
different algorithms against live users and measuring relevant actions
you want to see triggered by your recommender system (the number of
recommended items put into a shopping cart for example).

The paper "Google News Personalization: Scalable Online Collaborative
Filtering" ( 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.4329&rep=rep1&type=pdf
) has a chapter about how the guys there evaluated their newly built
recommender system, maybe that gives us some more ideas.

--sebastian

2010/12/28 Ted Dunning <[email protected]>
>
> Actually, the Mahout code does address some of these issues:
>
> On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <[email protected]> wrote:
>
> > Different people watch different numbers of movies.
>
>
> This is no problem except that with fewer movies rated (or watched if you
> are using implicit feedback) then the results are less certain.
>
> They also rate
> > some but not all.
>
>
> Again, not a problem.
>
>
> > Their recommendations may be in one or a few
> > clusters (other clustering can be genre, which day of the week is the
> > rating, on and on) or may be scattered all over genres (Harry Potter &
> > British comedy & European soft-core 70's porn).
>
>
> This isn't a problem except insofar as recommendations are a portfolio in
> which getting non-zero click-through on a set of recommendations is
> typically what you want, but most recommendation system optimize the
> expected number of clicks.  This isn't the same thing because clicks can
> correlate and it helps to hedge your bets by increasing the diversity of the
> recommended set.  This is usually handled in an ad hoc fashion.
>
>
> > Evaluating the worth
> > of user X's ratings is also important.
>
>
> No sure what you mean by this.  There are effectively several options for
> this in Mahout.
>
>
> > If you want to interpret the
> > ratings in an absolute number system, you want to map the incoming
> > ratings because they may average at 7.
> >
>
> Not sure what you mean by this.  If you have ratings limited to a particular
> range, then the average can't be outside that range.  You may indeed want to
> subtract the user mean rating for each user before building the rec data and
> add back the mean for the user being recommended.  Item means may be treated
> the same way.  This is equivalent to subtracting a rank-1 approximation of
> the rankings that is derived using SVD.
>
>
> >
> > The code in Mahout doesn't address these issues.
> >
>
> Hmmm... I think it does.  Perhaps Sean can comment.
>
> Moving to Otis' comments:
>
>
> >
> > On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> > <[email protected]> wrote:
> > > Hi,
> > >
> > > I was wondering how people evaluate the quality of recommendations other
> > than
> > > RMSE and such in eval package.
> >
>
> Off-line evaluation is difficult.  Your suggestion of MRR and related
> measures is reasonable, but I prefer to count every presentation on the
> first page as equivalent.
>
> The real problem is that historical data will only include presentations of
> items from a single recommendation system.  That means that any new system
> that brings in new recommendations is at a disadvantage at least in terms of
> error bars around the estimated click through rate.
>
> Another option is to compute grouped AUC for clicked items relative to
> unclicked items.  To do this, iterate over users with clicks.  Pick a random
> clicked item and a random unclicked item.  Score 1 if clicked item has
> higher score, 0 otherwise.  Ties can be broken at random, but I prefer to
> score 0 or 0.5 for them.  Average score near 1 is awesome.
>
> I don't find it all that helpful to use the exact rank.  Rather, I like to
> group all impressions that are shown in the same screenful together and then
> ignore second and later pages.  I also prefer to measure changes in behavior
> that has business value rather than just ratings.

Reply via email to