http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5686
------- Additional Comments From [EMAIL PROTECTED] 2007-10-24 03:04 ------- more meddling with tokenization. r587841 is an experiment to discard OSBF-style tokenization and just use the simpler SpamAssassin "split on whitespace" tokenization with the OSBF bigram format: SCORE NUMHIT DETAIL OVERALL HISTOGRAM (. = ham, # = spam) 0.000 ( 9.173%) ..........|........... 0.040 (21.726%) ..........|......................... 0.040 ( 0.011%) | 0.080 (47.814%) ..........|....................................................... 0.080 ( 0.017%) | 0.120 (15.204%) ..........|................. 0.120 ( 0.017%) | 0.160 ( 3.527%) ..........|.... 0.160 ( 0.006%) | 0.200 ( 1.331%) ..........|.. 0.200 ( 0.022%) | 0.240 ( 0.653%) ..........|. 0.240 ( 0.143%) ## | 0.280 ( 0.263%) ...... | 0.280 ( 0.397%) ###### | 0.320 ( 0.126%) ... | 0.320 ( 0.171%) ### | 0.360 ( 0.121%) ... | 0.360 ( 0.243%) #### | 0.400 ( 0.040%) . | 0.400 ( 0.303%) ##### | 0.440 ( 0.020%) | 0.440 ( 0.353%) ###### | 0.480 ( 0.496%) ######## | 0.520 ( 0.623%) ##########| 0.560 ( 0.579%) ######### | 0.600 ( 0.882%) ##########|# 0.640 ( 1.295%) ##########|# 0.680 ( 1.554%) ##########|# 0.720 (11.001%) ##########|######### 0.760 (69.604%) ##########|####################################################### 0.800 (11.436%) ##########|######### 0.840 ( 0.777%) ##########|# 0.880 ( 0.011%) | 0.960 ( 0.061%) # | Threshold optimization for hamcutoff=0.30, spamcutoff=0.70: cost=$160.00 Total ham:spam: 19764:18144 FP: 0 0.000% FN: 39 0.215% Unsure: 1210 3.192% (ham: 113 0.572% spam: 1097 6.046%) TCRs: l=1 15.972 l=5 15.972 l=9 15.972 SUMMARY: 0.30/0.70 fp 0 fn 39 uh 113 us 1097 c 160.00 So I think that basically doesn't work too well. There are a high number of one-off spam FNs scattered around the 0.040- 0.440 range, and ham FP at 0.880, which the more complex OSBF tokenization style avoids. ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.
