http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-25 12:11 -------
I've been doing some tokenizer tweaks, but none are really doing great; so one
thing that would be handy at this point is just to restate the current
"baseline" best results so far, in r585992.

The full 10-fold cross-validation's histogram is the last graph in comment 6 --
I'll paste it here:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (25.415%) 
..........|.......................................................
0.040 ( 9.831%) ..........|.....................
0.080 (22.571%) ..........|.................................................
0.120 (21.716%) ..........|...............................................
0.160 ( 8.435%) ..........|..................
0.200 ( 5.444%) ..........|............
0.200 ( 0.028%) #         |
0.240 ( 3.916%) ..........|........
0.240 ( 0.022%) #         |
0.280 ( 1.801%) ..........|....
0.280 ( 0.022%) #         |
0.320 ( 0.491%) ..........|.
0.320 ( 0.226%) #####     |
0.360 ( 0.116%) .....     |
0.360 ( 0.231%) ######    |
0.400 ( 0.040%) ..        |
0.400 ( 0.193%) #####     |
0.440 ( 0.132%) ###       |
0.480 ( 0.223%) ..........|
0.480 ( 1.334%) ##########|##
0.520 ( 0.110%) ###       |
0.560 ( 0.419%) ##########|#
0.600 ( 0.832%) ##########|#
0.640 ( 1.769%) ##########|##
0.680 ( 8.813%) ##########|###########
0.720 (36.767%) ##########|############################################
0.760 (45.712%) 
##########|#######################################################
0.800 ( 3.279%) ##########|####
0.840 ( 0.006%)           |
0.880 ( 0.011%)           |
0.920 ( 0.022%) #         |
0.960 ( 0.072%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$206.30
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     9 0.050%
Unsure:  1973 5.205%     (ham:   528 2.672%    spam:  1445 7.964%)
TCRs:              l=1 12.479    l=5 12.479    l=9 12.479
SUMMARY: 0.30/0.70  fp     0 fn     9 uh   528 us  1445    c 206.30

Conveniently I've noticed that fold 1 is pretty representative of that graph
and those numbers -- 

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (27.277%) 
..........|.......................................................
0.040 (10.020%) ..........|....................
0.080 (21.356%) ..........|...........................................
0.120 (24.190%) ..........|.................................................
0.160 ( 8.654%) ..........|.................
0.200 ( 5.061%) ..........|..........
0.200 ( 0.055%) #         |
0.240 ( 2.379%) ..........|.....
0.280 ( 0.709%) ..........|.
0.280 ( 0.055%) #         |
0.320 ( 0.152%) ......    |
0.320 ( 0.386%) ##########|#
0.360 ( 0.051%) ..        |
0.360 ( 0.165%) ####      |
0.400 ( 0.110%) ###       |
0.440 ( 0.662%) ##########|#
0.480 ( 0.152%) ......    |
0.480 ( 0.937%) ##########|#
0.520 ( 0.276%) #######   |
0.560 ( 0.827%) ##########|#
0.600 ( 1.213%) ##########|##
0.640 ( 1.985%) ##########|###
0.680 (11.025%) ##########|###############
0.720 (39.802%) 
##########|######################################################
0.760 (40.463%) 
##########|#######################################################
0.800 ( 1.985%) ##########|###
0.960 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$20.50
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   195 5.145%     (ham:    21 1.063%    spam:   174 9.592%)
TCRs:              l=1 10.366    l=5 10.366    l=9 10.366
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    21 us   174    c 20.50

This is handy because a single fold takes 1/10th of the time to run. ;)

(btw note that you have to scale the "threshold optimization" cost figure 10x
to cope with the corpus size differences, I should have normalized it but
didn't).

Anyway, I've checked it in as r588315.  This is the new baseline for further 
tests.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to