> AAD is not valid for comparison when you're not using
>ratings in your *recommender*. It's nothing to do with your similarity
>metric.

The penny drops.



On 26 October 2011 12:26, Sean Owen <[email protected]> wrote:
> (You can make evals run as fast as you like by sampling -- there's a
> parameter that controls how much data is used for the test. Of course
> it trades off accuracy.)
>
> Why are you saying these can't be compared by average absolute
> difference? If it's the same input data, and the data has ratings, you
> certainly can. AAD is not valid for comparison when you're not using
> ratings in your *recommender*. It's nothing to do with your similarity
> metric. These two don't use ratings, but, they're just figuring out
> weights for you -- doesn't affect the logic of estimating prefs.
>
> Yes, precision/recall and f-measure and fall out depend on a notion of
> "relevant" or "correct" results and this is a bit problematic in this
> context.
>
> A/B testing is the ultimate test, yes. But these evaluations you're
> running here do have value.
>
> On Wed, Oct 26, 2011 at 11:46 AM, lee carroll
> <[email protected]> wrote:
>> (Apologises if the ascii art fails)
>>
>> below is a made up table similar to what is presented in the mahout in
>> action book but with an IR evaluation stat added
>> (its made up as my machine takes to long to actually do the evaluations :-)
>>
>> -------------------------------------------------------------------------------------
>> Similarity |    N=1      |    N=2      |    N=4       |    N=8        |
>> -------------------------------------------------------------------------------------
>> Eclidean  | AAD | F1  | AAD | F1   | AAD | F1  | AAD | F1  |
>>               
>> ----------------------------------------------------------------------
>>              |1.17   |0.75 |1.12* |0.8 +|1.23  |0.67 |1.25   |0.7  |
>>               
>> ----------------------------------------------------------------------
>> Tanimoto | AAD | F1  | AAD | F1   | AAD | F1  | AAD | F1  |
>>
>> --------------------------------------------------------------------
>>              |1.32*  |0.6  |1.33  |0.56 |1.43 |0.51 |1.32*|0.69 |
>> ------------------------------------------------------------------------------------
>>
>> So *'s mark the best performing recommender within a set of recommeder 
>> results
>> the + marks the best performing recommender across recommenders. We
>> use the f1 measure as the recommenders in question
>> can't be compared with AAD directly.
>>
>> the use of f1 comes with the caveats that the threshold measure chosen
>> also impacts evaluation effectiveness (sigh) and "good" recomendations
>> used to calculate precision and recall can only come from items the
>> user has knowledge of.
>>
>> I think what I'm slowly crawling to is: AB testing on the live site is
>> still needed to confirm recommender choices. This is a great shame
>> as AB testing on a large site is such a pain and leaves the code /
>> content of a site in version hell. (It also involves a wide selection
>> of stake holders and potential metrics which in my experience
>> guarantees the results to be gerrymandered) Anyway I digress.
>>
>> Thanks for every ones help.
>>
>> Cheers Lee C
>> can only come from known
>>
>

Reply via email to