Ahh...

In that case, AUC and log-likelihood (for probability outputs) are the
natural measures of quality.  Precision at 20 or comparable measures are
also very helpful.

If you can deploy the system on a subset of data, then recommendation
click-through rate is the most realistic measure.  Hopefully you can push
this back to an off-line measure.

On Mon, Apr 25, 2011 at 6:12 PM, Peter Harrington <
[email protected]> wrote:

> Ted,
> Thanks for the quick response.  Perhaps I used the wrong terminology, but a
> recommender that uses binary data is nothing new.  For example: a news web
> site would like to recommend news stories based on your past
> viewing behavior:  you viewed an article or not.  Chapter 6 in Mahout in
> Action has the Wikipedia snapshot with link exists or not, recommendations
> are done on these binary datasets.
> The recommender is not generating a 1 or 0.
>
> Thanks again, I will probably go with precision.  What do you think about
> coverage?
> Peter
>
> On Mon, Apr 25, 2011 at 5:50 PM, Ted Dunning <[email protected]>
> wrote:
>
> > If the recommendation will only produce binary output scores and you have
> > actual held out user data, then you can still compute AUC.  If you want
> to
> > compute log-likelihood, you need to compute probabilities p_1 and p_2
> that
> > represent what the recommender *should* have said when it actually said 0
> > or
> > 1.  You can adapt these to give optimum log-likelihood on one held out
> set
> > and then get a real value for log-likelihood on another held out set.
> >
> > Precision, recall, false positive rate are also possibly useful.
> >
> > If the engine has an internal threshold knob, you can build ROC curves
> and
> > estimate AUC using averaging.
> >
> > But the question remains, why would use such a recommendation engine?
> >
> > On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington <
> > [email protected]> wrote:
> >
> > > Does anyone have a suggestion for how to evaluate a recommendation
> engine
> > > that uses a binary rating system?
> > > Usually the R scores (similarity score * rating of other items) are
> > > normalized by dividing by the sum of all rated similarity scores.  If I
> > do
> > > this for a binary scoring system I would get 1.0 for every item.
> > >
> > > Is there another normalization I can do to get a number between 0 and
> > 1.0?
> > > Should I just use precision and recall?
> > >
> > > Thanks for the help,
> > > Peter Harrington
> > >
> >
>

Reply via email to