> AAD is not valid for comparison when you're not using >ratings in your *recommender*. It's nothing to do with your similarity >metric.
The penny drops. On 26 October 2011 12:26, Sean Owen <[email protected]> wrote: > (You can make evals run as fast as you like by sampling -- there's a > parameter that controls how much data is used for the test. Of course > it trades off accuracy.) > > Why are you saying these can't be compared by average absolute > difference? If it's the same input data, and the data has ratings, you > certainly can. AAD is not valid for comparison when you're not using > ratings in your *recommender*. It's nothing to do with your similarity > metric. These two don't use ratings, but, they're just figuring out > weights for you -- doesn't affect the logic of estimating prefs. > > Yes, precision/recall and f-measure and fall out depend on a notion of > "relevant" or "correct" results and this is a bit problematic in this > context. > > A/B testing is the ultimate test, yes. But these evaluations you're > running here do have value. > > On Wed, Oct 26, 2011 at 11:46 AM, lee carroll > <[email protected]> wrote: >> (Apologises if the ascii art fails) >> >> below is a made up table similar to what is presented in the mahout in >> action book but with an IR evaluation stat added >> (its made up as my machine takes to long to actually do the evaluations :-) >> >> ------------------------------------------------------------------------------------- >> Similarity | N=1 | N=2 | N=4 | N=8 | >> ------------------------------------------------------------------------------------- >> Eclidean | AAD | F1 | AAD | F1 | AAD | F1 | AAD | F1 | >> >> ---------------------------------------------------------------------- >> |1.17 |0.75 |1.12* |0.8 +|1.23 |0.67 |1.25 |0.7 | >> >> ---------------------------------------------------------------------- >> Tanimoto | AAD | F1 | AAD | F1 | AAD | F1 | AAD | F1 | >> >> -------------------------------------------------------------------- >> |1.32* |0.6 |1.33 |0.56 |1.43 |0.51 |1.32*|0.69 | >> ------------------------------------------------------------------------------------ >> >> So *'s mark the best performing recommender within a set of recommeder >> results >> the + marks the best performing recommender across recommenders. We >> use the f1 measure as the recommenders in question >> can't be compared with AAD directly. >> >> the use of f1 comes with the caveats that the threshold measure chosen >> also impacts evaluation effectiveness (sigh) and "good" recomendations >> used to calculate precision and recall can only come from items the >> user has knowledge of. >> >> I think what I'm slowly crawling to is: AB testing on the live site is >> still needed to confirm recommender choices. This is a great shame >> as AB testing on a large site is such a pain and leaves the code / >> content of a site in version hell. (It also involves a wide selection >> of stake holders and potential metrics which in my experience >> guarantees the results to be gerrymandered) Anyway I digress. >> >> Thanks for every ones help. >> >> Cheers Lee C >> can only come from known >> >
