http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686





------- Additional Comments From [EMAIL PROTECTED]  2007-10-23 05:08 -------
(In reply to comment #7)
> 3. different tokenization

so I tried some of this out last night; I took one of the persistent FNs that
keeps showing up around the 0.2 mark, and examined the tokens being generated
during tokenization.  It turned out that some of the OSBF tokenization didn't
cope well with some of *our* tokens.

1. The decomposed address tokens, like "UD*jmason.org" for an email addr
containing hte domain "taint.org", were being split up into two tokens "UD*" and
"jmason.org" -- not useful -- so I fixed that; 

2. the "key=value" metadata in the X-Spam-Relays headers was similarly being
broken up into "key=", "value".  fixed.

this is checked in as r587469.  here's a histogram:

SCORE  NUMHIT   DETAIL     OVERALL HISTOGRAM  (. = ham, # = spam)
0.000 (21.949%) ..........|............................................
0.040 (21.620%) ..........|...........................................
0.080 (27.737%) 
..........|.......................................................
0.120 (12.351%) ..........|........................
0.160 ( 6.993%) ..........|..............
0.160 ( 0.044%) #         |
0.200 ( 4.802%) ..........|..........
0.200 ( 0.006%)           |
0.240 ( 2.656%) ..........|.....
0.280 ( 1.169%) ..........|..
0.280 ( 0.055%) #         |
0.320 ( 0.400%) ..........|.
0.320 ( 0.215%) #####     |
0.360 ( 0.172%) .......   |
0.360 ( 0.287%) #######   |
0.400 ( 0.056%) ..        |
0.400 ( 0.287%) #######   |
0.440 ( 0.083%) ##        |
0.480 ( 0.096%) ....      |
0.480 ( 1.075%) ##########|#
0.520 ( 0.276%) #######   |
0.560 ( 0.573%) ##########|#
0.600 ( 0.843%) ##########|#
0.640 ( 1.725%) ##########|##
0.680 ( 5.545%) ##########|#######
0.720 (20.387%) ##########|########################
0.760 (46.555%) 
##########|#######################################################
0.800 (20.800%) ##########|#########################
0.840 ( 1.141%) ##########|#
0.880 ( 0.017%)           |
0.920 ( 0.017%)           |
0.960 ( 0.072%) ##        |

Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$178.60
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:     9 0.050%
Unsure:  1696 4.474%     (ham:   374 1.892%    spam:  1322 7.286%)
TCRs:              l=1 13.632    l=5 13.632    l=9 13.632

Threshold optimization for hamcutoff=0.30, spamcutoff=0.54: cost=$130.40
Total ham:spam:   19764:18144
FP:     0 0.000%    FN:    11 0.061%
Unsure:   597 1.575%     (ham:   220 1.113%    spam:   377 2.078%)
TCRs:              l=1 46.763    l=5 46.763    l=9 46.763


looking quite a bit better!



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to