Actually, the Mahout code does address some of these issues:

On Mon, Dec 27, 2010 at 3:37 PM, Lance Norskog <[email protected]> wrote:

> Different people watch different numbers of movies.


This is no problem except that with fewer movies rated (or watched if you
are using implicit feedback) then the results are less certain.

They also rate
> some but not all.


Again, not a problem.


> Their recommendations may be in one or a few
> clusters (other clustering can be genre, which day of the week is the
> rating, on and on) or may be scattered all over genres (Harry Potter &
> British comedy & European soft-core 70's porn).


This isn't a problem except insofar as recommendations are a portfolio in
which getting non-zero click-through on a set of recommendations is
typically what you want, but most recommendation system optimize the
expected number of clicks.  This isn't the same thing because clicks can
correlate and it helps to hedge your bets by increasing the diversity of the
recommended set.  This is usually handled in an ad hoc fashion.


> Evaluating the worth
> of user X's ratings is also important.


No sure what you mean by this.  There are effectively several options for
this in Mahout.


> If you want to interpret the
> ratings in an absolute number system, you want to map the incoming
> ratings because they may average at 7.
>

Not sure what you mean by this.  If you have ratings limited to a particular
range, then the average can't be outside that range.  You may indeed want to
subtract the user mean rating for each user before building the rec data and
add back the mean for the user being recommended.  Item means may be treated
the same way.  This is equivalent to subtracting a rank-1 approximation of
the rankings that is derived using SVD.


>
> The code in Mahout doesn't address these issues.
>

Hmmm... I think it does.  Perhaps Sean can comment.

Moving to Otis' comments:


>
> On Mon, Dec 27, 2010 at 6:54 AM, Otis Gospodnetic
> <[email protected]> wrote:
> > Hi,
> >
> > I was wondering how people evaluate the quality of recommendations other
> than
> > RMSE and such in eval package.
>

Off-line evaluation is difficult.  Your suggestion of MRR and related
measures is reasonable, but I prefer to count every presentation on the
first page as equivalent.

The real problem is that historical data will only include presentations of
items from a single recommendation system.  That means that any new system
that brings in new recommendations is at a disadvantage at least in terms of
error bars around the estimated click through rate.

Another option is to compute grouped AUC for clicked items relative to
unclicked items.  To do this, iterate over users with clicks.  Pick a random
clicked item and a random unclicked item.  Score 1 if clicked item has
higher score, 0 otherwise.  Ties can be broken at random, but I prefer to
score 0 or 0.5 for them.  Average score near 1 is awesome.

I don't find it all that helpful to use the exact rank.  Rather, I like to
group all impressions that are shown in the same screenful together and then
ignore second and later pages.  I also prefer to measure changes in behavior
that has business value rather than just ratings.

Reply via email to