(You can make evals run as fast as you like by sampling -- there's a
parameter that controls how much data is used for the test. Of course
it trades off accuracy.)

Why are you saying these can't be compared by average absolute
difference? If it's the same input data, and the data has ratings, you
certainly can. AAD is not valid for comparison when you're not using
ratings in your *recommender*. It's nothing to do with your similarity
metric. These two don't use ratings, but, they're just figuring out
weights for you -- doesn't affect the logic of estimating prefs.

Yes, precision/recall and f-measure and fall out depend on a notion of
"relevant" or "correct" results and this is a bit problematic in this
context.

A/B testing is the ultimate test, yes. But these evaluations you're
running here do have value.

On Wed, Oct 26, 2011 at 11:46 AM, lee carroll
<[email protected]> wrote:
> (Apologises if the ascii art fails)
>
> below is a made up table similar to what is presented in the mahout in
> action book but with an IR evaluation stat added
> (its made up as my machine takes to long to actually do the evaluations :-)
>
> -------------------------------------------------------------------------------------
> Similarity |    N=1      |    N=2      |    N=4       |    N=8        |
> -------------------------------------------------------------------------------------
> Eclidean  | AAD | F1  | AAD | F1   | AAD | F1  | AAD | F1  |
>               
> ----------------------------------------------------------------------
>              |1.17   |0.75 |1.12* |0.8 +|1.23  |0.67 |1.25   |0.7  |
>               
> ----------------------------------------------------------------------
> Tanimoto | AAD | F1  | AAD | F1   | AAD | F1  | AAD | F1  |
>
> --------------------------------------------------------------------
>              |1.32*  |0.6  |1.33  |0.56 |1.43 |0.51 |1.32*|0.69 |
> ------------------------------------------------------------------------------------
>
> So *'s mark the best performing recommender within a set of recommeder results
> the + marks the best performing recommender across recommenders. We
> use the f1 measure as the recommenders in question
> can't be compared with AAD directly.
>
> the use of f1 comes with the caveats that the threshold measure chosen
> also impacts evaluation effectiveness (sigh) and "good" recomendations
> used to calculate precision and recall can only come from items the
> user has knowledge of.
>
> I think what I'm slowly crawling to is: AB testing on the live site is
> still needed to confirm recommender choices. This is a great shame
> as AB testing on a large site is such a pain and leaves the code /
> content of a site in version hell. (It also involves a wide selection
> of stake holders and potential metrics which in my experience
> guarantees the results to be gerrymandered) Anyway I digress.
>
> Thanks for every ones help.
>
> Cheers Lee C
> can only come from known
>

Reply via email to