The issue of offline tests is often misunderstood I suspect. While I agree with 
Ted it might do to explain a bit.

For myself I'd say offline testing is a requirement but not for comparing two 
disparate recommenders. Companies like Amazon and Netflix, as well as others on 
record, have a workflow that includes offline testing and comparison against 
previous versions of their own code and their own gold data set. These 
comparisons can be quite useful, if only in pointing to otherwise obscure bugs. 
If they see a difference in two offline tests they ask, why? Then when they 
think they have an optimal solution they do A/B tests as challenger/champion 
competitions and it's these that are the only reliable measure of goodness.

I do agree that comparing two recommenders with offline tests is dubious at 
best, as the paper points out. But put yourself in the place of a company new 
to recommenders who has several to choose from. Maybe even versions of the same 
recommender with different tuning parameters. Do the offline tests with a 
standard set of your own data and pick the best to start with. What other 
choice do you have? Maybe flexibility or architecture trumps the offline tests, 
if not then using them is better than a random choice. Take this result with a 
grain of salt though and get ready to A/B test later challengers when or if you 
have time.

In the case of the Solr recommender it is extremely flexible and online 
(realtime results). These features for me trump any offline tests against 
alternatives. But the demo site will include offline Mahout recommendations for 
comparison, and in the unlikely event that it gets any traffic, will 
incorporate A/B tests.

On Oct 9, 2013, at 4:29 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:


On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov 
<msoko...@safaribooksonline.com> wrote:

BTW lest we forget this does not imply the Solr-recommender is better than 
Myrrix or the Mahout-only recommenders. There needs to be some careful 
comparison of results. Michael, did you do offline or A/B tests during your 
implementation?

I ran some offline tests using our historical data, but I don't have a lot of 
faith in these beyond the fact they indicate we didn't make any obvious 
implementation errors.  We haven't attempted A/B testing yet since our site is 
so new, and we need to get a meaningful baseline going and sort out a lot of 
other more pressing issues on the site - recommendations are only one piece, 
albeit an important one.


Actually there was an interesting idea for an article posted recently about the 
difficulty of comparing results across systems in this field: 
http://www.docear.org/2013/09/23/research-paper-recommender-system-evaluation-a-quantitative-literature-survey/
 but that's no excuse not to do better.  I'll certainly share when I know more 
:)

I tend to be a pessimist with regard to off-line evaluation.  It is fine to do, 
but if a system is anywhere near best, I think that it should be considered for 
A/B testing.



Reply via email to