http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686
------- Additional Comments From [EMAIL PROTECTED] 2007-10-15 05:51 -------
Let's have some graphs!
Here's a graph of scores from SVN trunk's version of Bayes, measured using
10-fold cross validation on a corpus of ~2000 recent spam and ~2000 recent ham
from my collection (I'm using this corpus to measure results as I develop
this):
http://taint.org/x/2007/graph_trunk.png
And here's a graph on the same corpus, classified using osbf-lua:
http://taint.org/x/2007/graph_osbflua.png
You can see several things:
- current trunk's Bayes has a tendency to put a fair bit of spam into the
"unsure" middle ground, BAYES_50, where it gets no score.
- osbf-lua is better at separating more of the samples into their correct
class, with a more or less clear dividing line around -15. (I'm not sure
what their score figure represents.)
This demonstrates that the algorithms used in osbf-lua are pretty effective, in
my opinion (and gives us an idea of what osbf can do, something to aim for with
our implementation).
Now for the implementation of Winnow/OSBF as checked in in r584432,
compared to SVN trunk. Here's a score histogram from trunk:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (99.914%)
..........|.......................................................
0.000 ( 0.761%) ######### |
0.040 ( 0.020%) |
0.040 ( 0.028%) |
0.080 ( 0.050%) # |
0.120 ( 0.015%) |
0.120 ( 0.039%) |
0.160 ( 0.005%) |
0.160 ( 0.011%) |
0.200 ( 0.017%) |
0.240 ( 0.005%) |
0.240 ( 0.022%) |
0.280 ( 0.017%) |
0.320 ( 0.011%) |
0.360 ( 0.028%) |
0.400 ( 0.010%) |
0.400 ( 0.017%) |
0.440 ( 0.005%) |
0.440 ( 0.083%) # |
0.480 ( 0.025%) |
0.480 ( 2.122%) ##########|#
0.520 ( 0.231%) ### |
0.560 ( 0.138%) ## |
0.600 ( 0.088%) # |
0.640 ( 0.127%) # |
0.680 ( 0.121%) # |
0.720 ( 0.182%) ## |
0.760 ( 0.193%) ## |
0.800 ( 0.187%) ## |
0.840 ( 0.116%) # |
0.880 ( 0.215%) ## |
0.920 ( 0.375%) #### |
0.960 (94.825%)
##########|#######################################################
(hopefully that pastes ok). the thing we want to see is all "."s at 0.000, all
"#"s at 0.960-1.0, no "."s between 0.5 and 1.0 (false positives), and no "#"s
between 0.0 and 0.5 (false negatives).
here's the histogram for r584432:
SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam)
0.000 (94.728%)
..........|.......................................................
0.000 ( 0.077%) # |
0.960 ( 5.272%) ..........|...
0.960 (99.923%)
##########|#######################################################
that's very good, except for the 5.272% of false positives :( we need to avoid
that, since 5% fps is serious.
the "thresholds" cost figure (in "results/thresholds.static"), which comes up
with a single-figure metric based on the score distribution, looks like this:
trunk:
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$222.30
Total ham:spam: 19764:18144
FP: 0 0.000% FN: 168 0.926%
Unsure: 543 1.432% (ham: 8 0.040% spam: 535 2.949%)
TCRs: l=1 25.809 l=5 25.809 l=9 25.809
SUMMARY: 0.30/0.70 fp 0 fn 168 uh 8 us 535 c 222.30
r584432:
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$10434.00
Total ham:spam: 19764:18144
FP: 1042 5.272% FN: 14 0.077%
Unsure: 0 0.000% (ham: 0 0.000% spam: 0 0.000%)
TCRs: l=1 17.182 l=5 3.473 l=9 1.932
SUMMARY: 0.30/0.70 fp 1042 fn 14 uh 0 us 0 c 10434.00
that cost metric penalised the 5% fp rate very heavily.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.