Ahh... In that case, AUC and log-likelihood (for probability outputs) are the natural measures of quality. Precision at 20 or comparable measures are also very helpful.
If you can deploy the system on a subset of data, then recommendation click-through rate is the most realistic measure. Hopefully you can push this back to an off-line measure. On Mon, Apr 25, 2011 at 6:12 PM, Peter Harrington < [email protected]> wrote: > Ted, > Thanks for the quick response. Perhaps I used the wrong terminology, but a > recommender that uses binary data is nothing new. For example: a news web > site would like to recommend news stories based on your past > viewing behavior: you viewed an article or not. Chapter 6 in Mahout in > Action has the Wikipedia snapshot with link exists or not, recommendations > are done on these binary datasets. > The recommender is not generating a 1 or 0. > > Thanks again, I will probably go with precision. What do you think about > coverage? > Peter > > On Mon, Apr 25, 2011 at 5:50 PM, Ted Dunning <[email protected]> > wrote: > > > If the recommendation will only produce binary output scores and you have > > actual held out user data, then you can still compute AUC. If you want > to > > compute log-likelihood, you need to compute probabilities p_1 and p_2 > that > > represent what the recommender *should* have said when it actually said 0 > > or > > 1. You can adapt these to give optimum log-likelihood on one held out > set > > and then get a real value for log-likelihood on another held out set. > > > > Precision, recall, false positive rate are also possibly useful. > > > > If the engine has an internal threshold knob, you can build ROC curves > and > > estimate AUC using averaging. > > > > But the question remains, why would use such a recommendation engine? > > > > On Mon, Apr 25, 2011 at 5:28 PM, Peter Harrington < > > [email protected]> wrote: > > > > > Does anyone have a suggestion for how to evaluate a recommendation > engine > > > that uses a binary rating system? > > > Usually the R scores (similarity score * rating of other items) are > > > normalized by dividing by the sum of all rated similarity scores. If I > > do > > > this for a binary scoring system I would get 1.0 for every item. > > > > > > Is there another normalization I can do to get a number between 0 and > > 1.0? > > > Should I just use precision and recall? > > > > > > Thanks for the help, > > > Peter Harrington > > > > > >
