http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-15 05:51 -------
Let's have some graphs!

Here's a graph of scores from SVN trunk's version of Bayes, measured using
10-fold cross validation on a corpus of ~2000 recent spam and ~2000 recent ham
from my collection (I'm using this corpus to measure results as I develop
this):

    http://taint.org/x/2007/graph_trunk.png

And here's a graph on the same corpus, classified using osbf-lua:

    http://taint.org/x/2007/graph_osbflua.png

You can see several things:

- current trunk's Bayes has a tendency to put a fair bit of spam into the
  "unsure" middle ground, BAYES_50, where it gets no score.

- osbf-lua is better at separating more of the samples into their correct
  class, with a more or less clear dividing line around -15.  (I'm not sure
  what their score figure represents.)

This demonstrates that the algorithms used in osbf-lua are pretty effective, in
my opinion (and gives us an idea of what osbf can do, something to aim for with
our implementation).


Now for the implementation of Winnow/OSBF as checked in in r584432,
compared to SVN trunk.  Here's a score histogram from trunk:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (99.914%) 
..........|.......................................................
0.000 ( 0.761%) ######### |
0.040 ( 0.020%)           |
0.040 ( 0.028%)           |
0.080 ( 0.050%) #         |
0.120 ( 0.015%)           |
0.120 ( 0.039%)           |
0.160 ( 0.005%)           |
0.160 ( 0.011%)           |
0.200 ( 0.017%)           |
0.240 ( 0.005%)           |
0.240 ( 0.022%)           |
0.280 ( 0.017%)           |
0.320 ( 0.011%)           |
0.360 ( 0.028%)           |
0.400 ( 0.010%)           |
0.400 ( 0.017%)           |
0.440 ( 0.005%)           |
0.440 ( 0.083%) #         |
0.480 ( 0.025%)           |
0.480 ( 2.122%) ##########|#
0.520 ( 0.231%) ###       |
0.560 ( 0.138%) ##        |
0.600 ( 0.088%) #         |
0.640 ( 0.127%) #         |
0.680 ( 0.121%) #         |
0.720 ( 0.182%) ##        |
0.760 ( 0.193%) ##        |
0.800 ( 0.187%) ##        |
0.840 ( 0.116%) #         |
0.880 ( 0.215%) ##        |
0.920 ( 0.375%) ####      |
0.960 (94.825%) 
##########|#######################################################

(hopefully that pastes ok).  the thing we want to see is all "."s at 0.000, all
"#"s at 0.960-1.0, no "."s between 0.5 and 1.0 (false positives), and no "#"s
between 0.0 and 0.5 (false negatives).


here's the histogram for r584432:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (94.728%) 
..........|.......................................................
0.000 ( 0.077%) #         |
0.960 ( 5.272%) ..........|...
0.960 (99.923%) 
##########|#######################################################

that's very good, except for the 5.272% of false positives :(  we need to avoid
that, since 5% fps is serious.

the "thresholds" cost figure (in "results/thresholds.static"), which comes up
with a single-figure metric based on the score distribution, looks like this:

trunk:
  Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$222.30
  Total ham:spam:   19764:18144
  FP:     0 0.000%    FN:   168 0.926%
  Unsure:   543 1.432%     (ham:     8 0.040%    spam:   535 2.949%)
  TCRs:              l=1 25.809    l=5 25.809    l=9 25.809
  SUMMARY: 0.30/0.70  fp     0 fn   168 uh     8 us   535    c 222.30

r584432:
  Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$10434.00
  Total ham:spam:   19764:18144
  FP:  1042 5.272%    FN:    14 0.077%
  Unsure:     0 0.000%     (ham:     0 0.000%    spam:     0 0.000%)
  TCRs:              l=1 17.182    l=5 3.473    l=9 1.932
  SUMMARY: 0.30/0.70  fp  1042 fn    14 uh     0 us     0    c 10434.00

that cost metric penalised the 5% fp rate very heavily.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to