(Apologises if the ascii art fails)
below is a made up table similar to what is presented in the mahout in
action book but with an IR evaluation stat added
(its made up as my machine takes to long to actually do the evaluations :-)
-------------------------------------------------------------------------------------
Similarity | N=1 | N=2 | N=4 | N=8 |
-------------------------------------------------------------------------------------
Eclidean | AAD | F1 | AAD | F1 | AAD | F1 | AAD | F1 |
----------------------------------------------------------------------
|1.17 |0.75 |1.12* |0.8 +|1.23 |0.67 |1.25 |0.7 |
----------------------------------------------------------------------
Tanimoto | AAD | F1 | AAD | F1 | AAD | F1 | AAD | F1 |
--------------------------------------------------------------------
|1.32* |0.6 |1.33 |0.56 |1.43 |0.51 |1.32*|0.69 |
------------------------------------------------------------------------------------
So *'s mark the best performing recommender within a set of recommeder results
the + marks the best performing recommender across recommenders. We
use the f1 measure as the recommenders in question
can't be compared with AAD directly.
the use of f1 comes with the caveats that the threshold measure chosen
also impacts evaluation effectiveness (sigh) and "good" recomendations
used to calculate precision and recall can only come from items the
user has knowledge of.
I think what I'm slowly crawling to is: AB testing on the live site is
still needed to confirm recommender choices. This is a great shame
as AB testing on a large site is such a pain and leaves the code /
content of a site in version hell. (It also involves a wide selection
of stake holders and potential metrics which in my experience
guarantees the results to be gerrymandered) Anyway I digress.
Thanks for every ones help.
Cheers Lee C
can only come from known