http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-22 07:10 -------
(In reply to comment #7)
> 2. the effect of less training data, which is the real issue -- can OSBF do a
> better job with tiny amounts of training, than our existing Bayes impl?

results from the weekend's testing of this.  I ran the 10fold cross-validation
driver with "--learnprob 0.1 --randseed 23" -- ie. train on only 10% of the
messages -- and got these histograms:

SVN trunk:

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$252.30
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:   155 0.854%
Unsure:   973 2.567%     (ham:    24 0.121%    spam:   949 5.230%)
TCRs:              l=1 16.435    l=5 16.435    l=9 16.435
SUMMARY: 0.30/0.70  fp     0 fn   155 uh    24 us   949    c 252.30

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (99.676%) 
..........|.......................................................
0.000 ( 0.645%) ########  |
0.040 ( 0.040%)           |
0.040 ( 0.055%) #         |
0.080 ( 0.040%)           |
0.080 ( 0.022%)           |
0.120 ( 0.030%)           |
0.120 ( 0.050%) #         |
0.160 ( 0.035%)           |
0.160 ( 0.022%)           |
0.200 ( 0.040%)           |
0.200 ( 0.028%)           |
0.240 ( 0.015%)           |
0.240 ( 0.033%)           |
0.280 ( 0.020%)           |
0.280 ( 0.077%) #         |
0.320 ( 0.015%)           |
0.320 ( 0.061%) #         |
0.360 ( 0.015%)           |
0.360 ( 0.044%) #         |
0.400 ( 0.015%)           |
0.400 ( 0.121%) #         |
0.440 ( 0.035%)           |
0.440 ( 0.198%) ##        |
0.480 ( 0.020%)           |
0.480 ( 3.919%) ##########|##
0.520 ( 0.314%) ####      |
0.560 ( 0.165%) ##        |
0.600 ( 0.149%) ##        |
0.640 ( 0.077%) #         |
0.680 ( 0.215%) ###       |
0.720 ( 0.116%) #         |
0.760 ( 0.116%) #         |
0.800 ( 0.171%) ##        |
0.840 ( 0.121%) #         |
0.880 ( 0.193%) ##        |
0.920 ( 0.336%) ####      |
0.960 (92.752%) 
##########|#######################################################


OSBF with EDDC:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 4.007%) ..........|........
0.040 ( 3.177%) ..........|......
0.080 (18.787%) ..........|....................................
0.120 (28.415%) 
..........|.......................................................
0.160 (17.588%) ..........|..................................
0.160 ( 0.006%)           |
0.200 (11.369%) ..........|......................
0.200 ( 0.011%)           |
0.240 ( 7.357%) ..........|..............
0.240 ( 0.022%) #         |
0.280 ( 4.574%) ..........|.........
0.280 ( 0.033%) #         |
0.320 ( 3.046%) ..........|......
0.320 ( 0.127%) ####      |
0.360 ( 1.184%) ..........|..
0.360 ( 0.303%) ######### |
0.400 ( 0.233%) ......... |
0.400 ( 0.733%) ##########|#
0.440 ( 0.046%) ..        |
0.440 ( 0.424%) ##########|#
0.480 ( 0.207%) ........  |
0.480 ( 1.560%) ##########|##
0.520 ( 0.010%)           |
0.520 ( 1.036%) ##########|##
0.560 ( 1.565%) ##########|##
0.600 ( 1.984%) ##########|###
0.640 ( 5.958%) ##########|#########
0.680 (20.993%) ##########|###############################
0.720 (36.795%) 
##########|#######################################################
0.760 (25.143%) ##########|######################################
0.800 ( 3.213%) ##########|#####
0.840 ( 0.083%) ##        |
0.960 ( 0.011%)           |

the thresholds report looks like this
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$583.00
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     7 0.039%
Unsure:  5760 15.195%     (ham:  1838 9.300%    spam:  3922 21.616%)
TCRs:              l=1 4.618    l=5 4.618    l=9 4.618
SUMMARY: 0.30/0.70  fp     0 fn     7 uh  1838 us  3922    c 583.00

but that's unfair, because 0.70 (as you can see from the histogram)
is right in the middle of most of the ham.  0.56 would be better:

Threshold optimization for hamcutoff=0.38, spamcutoff=0.56: cost=$234.80
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    55 0.303%
Unsure:   899 2.372%     (ham:   182 0.921%    spam:   717 3.952%)
TCRs:              l=1 23.503    l=5 23.503    l=9 23.503

I guess it's good, but it's not stellar :(



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to