[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

bugzilla-daemon Fri, 06 Jul 2007 13:44:17 -0700

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376






------- Additional Comments From [EMAIL PROTECTED]  2007-07-06 13:43 -------
Re: Bayes immutability

All I'm really trying to say is that during scoring runs, we should be changing
the BAYES_ scores. We can manually make them "sane" if necessary, but we need to
change the values periodically to reflect the changing importance of the Bayes
rules relative to other rules. (In our LR research, we would have given BAYES_99
a score of >6 if we could have, so realistically a score of 4.5 would be fair
for BAYES_99. The best way to determine what it should be is using a scoring
mechanism.)

I can agree to disagree.

Re: TCR

TCR = number of spam / (number of fns + lambda * number of fps)

If you remember what TCR represents, it makes sense that TCR depends on the
relative ratio of ham/spam of the corpus. (If you don't,
http://wiki.spamassassin.org/TotalCostRatio -- but the wiki page really
complicates the calculation) It's still fine for ranking different algorthims on
the same corpus, but if you're upset about it, I propose the following new
measurement. Let's call it the Findlay measurement:

F(lambda) = 1 / (FN% + FP% * lambda)

(Actually as defined above it's a function.)

This is exactly equal to TCR on a balanced (50/50) corpus and doesn't have the
"undesirable" properties you mentioned.

(Ok, don't call it the Findlay measurement... it's a stupid name, and it's a
fairly trivial derivation...)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5376] RFE: generate a "SpamAssassin Challenge" score-generation test

Reply via email to