I'd say recommenders answer at least one and maybe three of the following three questions:
A. What are the top n best recommendations? B. What are the top n best recommendations, ordered by quality? C. What is the likely preference value for a given item and user? All recommenders answer A. Virtually all answer B in the course of answering A. Most answer C in the course of answering B. Precision/recall tests measure effectiveness in answering A. MAE tests measure effectiveness in answering C (which isn't applicable for all recommenders). You are only interested in A. Therefore precision/recall is the right test to use. (Really, recall isn't going to be useful, but precision is.) There's no such thing as a perfect evaluator. To perfectly evaluate recommendations you'd need to know the user's true preference for everything, which even the user does not know. These two techniques are standard and fine approximations to perfect evaluation. I am not saying they're flawed, no. They do have biases though which are worth understanding. On Wed, Mar 3, 2010 at 10:23 AM, Mirko <[email protected]> wrote: > >> In a sense, evaluating the quality of predictions is slightly the >> wrong question to ask. After all a recommender's primary job is to >> make ordered recommendations, only. It does not necessarily need to >> predict preferences to do this, though most do. > > I see. Maybe this is part of my problem, apart from the novelty issue. I want > to evaluate the quality of predictions, rather then how well they are > ordered. To illustrate, I recommend an unordered list of top 5 items, it does > not matter if item 1 and 4 are interchanged. But it may matter if item 4 and > 7 are interchanged. Thus, it is not sufficient for my evaluation to ask > whether predictions are in correct order. I rather need to evaluate the > quality of the entirety of the 5 recommendations. I hoped that PR could be > more appropriate then MAE to measure this 'quality of predictions' (rather > then 'quality of ordering'). But if I get you correctly both, PR and MAE, are > not appropriate to measure quality of predictions (directly). > >> I don't have a good reference for you but I think there's really one >> way forward to evaluation: you need to collect data about how often >> your recommended items were viewed / clicked, and how they were rated. >> That is you'd really have to deploy the recommender and evaluate it >> going forward. I just can't imagine any other solution since it is >> necessarily based on information you don't have yet. > > Yes, I will give this a go. > > Thanks for your comments, > Mirko > >
