http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-30 11:24 -------
more tests.  setting N_SIGNIFICANT_TOKENS to be infinite (ie. using all
tokens instead of the N most significant/strong ones), is bad:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 ( 0.506%) ..........|.
0.040 (25.658%) ..........|.................................
0.080 (43.067%) 
..........|.......................................................
0.120 (22.166%) ..........|............................
0.120 ( 0.055%) #         |
0.160 ( 6.275%) ..........|........
0.200 ( 1.569%) ..........|..
0.200 ( 0.055%) #         |
0.240 ( 0.607%) ..........|.
0.240 ( 0.717%) ##########|#
0.280 ( 0.051%) .         |
0.280 ( 0.276%) ####      |
0.320 ( 0.101%) ...       |
0.320 ( 0.276%) ####      |
0.360 ( 0.276%) ####      |
0.400 ( 0.221%) ###       |
0.440 ( 0.441%) #######   |
0.480 ( 0.662%) ##########|#
0.520 ( 1.323%) ##########|#
0.560 ( 0.882%) ##########|#
0.600 ( 0.827%) ##########|#
0.640 ( 0.882%) ##########|#
0.680 ( 1.047%) ##########|#
0.720 ( 8.379%) ##########|######
0.760 (70.948%) 
##########|#######################################################
0.800 (12.679%) ##########|##########
0.880 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$27.60
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    15 0.827%
Unsure:   126 3.325%     (ham:     3 0.152%    spam:   123 6.781%)
TCRs:              l=1 13.145    l=5 13.145    l=9 13.145
SUMMARY: 0.30/0.70  fp     0 fn    15 uh     3 us   123    c 27.60


N_SIGNIFICANT_TOKENS=999 still on the wrong side of the baseline:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%) 
..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) #         |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ###       |
0.320 ( 0.152%) .....     |
0.320 ( 0.331%) ########  |
0.360 ( 0.101%) ....      |
0.360 ( 0.110%) ###       |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) ####      |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%) 
##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   178 4.697%     (ham:    13 0.658%    spam:   165 9.096%)
TCRs:              l=1 10.928    l=5 10.928    l=9 10.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    13 us   165    c 18.80


N_SIGNIFICANT_TOKENS=150, ditto:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (24.747%) ..........|............................................
0.040 (18.522%) ..........|.................................
0.080 (31.123%) 
..........|.......................................................
0.120 (13.057%) ..........|.......................
0.160 ( 5.820%) ..........|..........
0.160 ( 0.055%) #         |
0.200 ( 4.251%) ..........|........
0.240 ( 1.822%) ..........|...
0.280 ( 0.405%) ..........|.
0.280 ( 0.110%) ###       |
0.320 ( 0.152%) .....     |
0.320 ( 0.331%) ########  |
0.360 ( 0.101%) ....      |
0.360 ( 0.110%) ###       |
0.400 ( 0.772%) ##########|#
0.440 ( 0.165%) ####      |
0.480 ( 0.717%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 0.992%) ##########|#
0.600 ( 1.268%) ##########|##
0.640 ( 1.985%) ##########|##
0.680 ( 7.166%) ##########|########
0.720 (24.862%) ##########|#############################
0.760 (46.472%) 
##########|#######################################################
0.800 (13.671%) ##########|################
0.840 ( 0.662%) ##########|#
0.920 ( 0.055%) #         |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$18.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:   178 4.697%     (ham:    13 0.658%    spam:   165 9.096%)
TCRs:              l=1 10.928    l=5 10.928    l=9 10.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    13 us   165    c 18.80


Trying out a new tokenization, where the header and URIs are simply
"split on whitespace", but the body still uses the full OSBF tokenization,
is pretty bad compared to baseline:


0.000 ( 4.706%) ..........|..........
0.040 (11.285%) ..........|........................
0.080 (11.842%) ..........|.........................
0.120 (25.860%) 
..........|.......................................................
0.160 (25.607%) 
..........|......................................................
0.200 (11.437%) ..........|........................
0.200 ( 0.055%) #         |
0.240 ( 6.174%) ..........|.............
0.280 ( 2.429%) ..........|.....
0.280 ( 0.165%) ###       |
0.320 ( 0.506%) ..........|.
0.320 ( 0.276%) #####     |
0.360 ( 0.051%) ..        |
0.360 ( 0.221%) ####      |
0.400 ( 0.772%) ##########|#
0.440 ( 0.221%) ####      |
0.480 ( 0.101%) ....      |
0.480 ( 1.433%) ##########|#
0.520 ( 0.606%) ##########|#
0.560 ( 1.433%) ##########|#
0.600 ( 2.150%) ##########|##
0.640 (16.869%) ##########|################
0.680 (58.545%) 
##########|#######################################################
0.720 (17.089%) ##########|################
0.760 ( 0.110%) ##        |
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$101.10
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:     1 0.055%
Unsure:  1001 26.412%     (ham:    61 3.087%    spam:   940 51.819%)
TCRs:              l=1 1.928    l=5 1.928    l=9 1.928
SUMMARY: 0.30/0.70  fp     0 fn     1 uh    61 us   940    c 101.10



split(' ') for just headers is also not an improvement:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (11.184%) ..........|...............
0.040 (34.615%) ..........|..............................................
0.080 (41.346%) 
..........|.......................................................
0.120 (10.273%) ..........|..............
0.120 ( 0.055%) #         |
0.160 ( 1.569%) ..........|..
0.200 ( 0.709%) ..........|.
0.200 ( 0.055%) #         |
0.240 ( 0.304%) ........  |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ###       |
0.320 ( 0.221%) ###       |
0.360 ( 0.386%) ######    |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.268%) ##########|#
0.520 ( 0.992%) ##########|#
0.560 ( 1.764%) ##########|#
0.600 ( 1.488%) ##########|#
0.640 ( 5.347%) ##########|####
0.680 (70.232%) 
##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.103%) ##########|#
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.30
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    17 0.937%
Unsure:   853 22.507%     (ham:     0 0.000%    spam:   853 47.023%)
TCRs:              l=1 2.085    l=5 2.085    l=9 2.085
SUMMARY: 0.30/0.70  fp     0 fn    17 uh     0 us   853    c 102.30


tokenizing just URLs this way is even worse (see that FP creeping closer
to 0.0):

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (10.374%) ..........|.............
0.040 (34.109%) ..........|...........................................
0.080 (43.168%) 
..........|.......................................................
0.080 ( 0.055%) #         |
0.120 ( 9.818%) ..........|.............
0.160 ( 1.518%) ..........|..
0.200 ( 0.759%) ..........|.
0.200 ( 0.055%) #         |
0.240 ( 0.253%) ......    |
0.240 ( 0.827%) ##########|#
0.280 ( 0.165%) ###       |
0.320 ( 0.221%) ###       |
0.360 ( 0.386%) ######    |
0.400 ( 0.551%) ######### |
0.440 ( 0.551%) ######### |
0.480 ( 1.323%) ##########|#
0.520 ( 0.937%) ##########|#
0.560 ( 1.985%) ##########|##
0.600 ( 1.213%) ##########|#
0.640 ( 5.788%) ##########|#####
0.680 (69.901%) 
##########|#######################################################
0.720 (14.939%) ##########|############
0.760 ( 1.047%) ##########|#
0.840 ( 0.055%) #         |
Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$102.80
Total ham:spam:   1976:1814
FP:     0 0.000%    FN:    17 0.937%
Unsure:   858 22.639%     (ham:     0 0.000%    spam:   858 47.299%)
TCRs:              l=1 2.073    l=5 2.073    l=9 2.073
SUMMARY: 0.30/0.70  fp     0 fn    17 uh     0 us   858    c 102.80

interesting!  these were all tweaks I thought might help, but they
really don't -- the graphs and figures don't lie.  The baseline
tokenization just works better in all my testing...





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to