Re: Solr-recommender

Pat Ferrel Thu, 10 Oct 2013 09:01:19 -0700

The issue of offline tests is often misunderstood I suspect. While I agree with 
Ted it might do to explain a bit.

For myself I'd say offline testing is a requirement but not for comparing two
disparate recommenders. Companies like Amazon and Netflix, as well as others on
record, have a workflow that includes offline testing and comparison against
previous versions of their own code and their own gold data set. These
comparisons can be quite useful, if only in pointing to otherwise obscure bugs.
If they see a difference in two offline tests they ask, why? Then when they
think they have an optimal solution they do A/B tests as challenger/champion
competitions and it's these that are the only reliable measure of goodness.

I do agree that comparing two recommenders with offline tests is dubious at
best, as the paper points out. But put yourself in the place of a company new
to recommenders who has several to choose from. Maybe even versions of the same
recommender with different tuning parameters. Do the offline tests with a
standard set of your own data and pick the best to start with. What other
choice do you have? Maybe flexibility or architecture trumps the offline tests,
if not then using them is better than a random choice. Take this result with a
grain of salt though and get ready to A/B test later challengers when or if you
have time.

In the case of the Solr recommender it is extremely flexible and online
(realtime results). These features for me trump any offline tests against
alternatives. But the demo site will include offline Mahout recommendations for
comparison, and in the unlikely event that it gets any traffic, will
incorporate A/B tests.

On Oct 9, 2013, at 4:29 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov
<msoko...@safaribooksonline.com> wrote:

BTW lest we forget this does not imply the Solr-recommender is better than
Myrrix or the Mahout-only recommenders. There needs to be some careful
comparison of results. Michael, did you do offline or A/B tests during your
implementation?

I ran some offline tests using our historical data, but I don't have a lot of
faith in these beyond the fact they indicate we didn't make any obvious
implementation errors. We haven't attempted A/B testing yet since our site is
so new, and we need to get a meaningful baseline going and sort out a lot of
other more pressing issues on the site - recommendations are only one piece,
albeit an important one.

Actually there was an interesting idea for an article posted recently about the
difficulty of comparing results across systems in this field:
http://www.docear.org/2013/09/23/research-paper-recommender-system-evaluation-a-quantitative-literature-survey/
but that's no excuse not to do better. I'll certainly share when I know more
:)

I tend to be a pessimist with regard to off-line evaluation. It is fine to do,
but if a system is anywhere near best, I think that it should be considered for
A/B testing.

Re: Solr-recommender

Reply via email to