http://bugzilla.spamassassin.org/show_bug.cgi?id=2419
------- Additional Comments From [EMAIL PROTECTED] 2004-05-06 01:21 -------
Subject: Re: Increased Bayes Score Breakdown Near Extremes
> Dan -- are you suggesting merging some of the intermediate ranges into
> a smaller number of rules? care to make a concrete suggestion so we
> can close the bug? ;)
I'm suggesting recalibrating around the hits a bit more closely. I
needed to do a bit more work to make a concrete suggestion...
> do these ranges make sense:
>
> 0.00, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.8, ... (mirror image on other end)
Here's a histogram of my Bayes percentages (ham and spam all lumped
together), autolearning, my current nightly corpus of 15k spam and 15k
ham.
0.00-0.01 46.530
0.01-0.02 0.310
0.02-0.03 0.157
0.03-0.04 0.136
0.04-0.05 0.078
0.05-0.06 0.085
0.06-0.07 0.102
0.07-0.08 0.055
0.08-0.09 0.082
0.09-0.10 0.051
0.10-0.11 0.044
0.11-0.12 0.041
0.12-0.13 0.041
0.13-0.14 0.044
0.14-0.15 0.031
0.15-0.16 0.034
0.16-0.17 0.041
0.17-0.18 0.031
0.18-0.19 0.048
0.19-0.20 0.034
0.20-0.21 0.034
0.21-0.22 0.041
0.22-0.23 0.031
0.23-0.24 0.027
0.24-0.25 0.014
0.25-0.26 0.031
0.26-0.27 0.027
0.27-0.28 0.031
0.28-0.29 0.027
0.29-0.30 0.027
0.30-0.31 0.007
0.31-0.32 0.007
0.32-0.33 0.041
0.33-0.34 0.041
0.34-0.35 0.024
0.35-0.36 0.024
0.36-0.37 0.034
0.37-0.38 0.048
0.38-0.39 0.027
0.39-0.40 0.041
0.40-0.41 0.027
0.41-0.42 0.020
0.42-0.43 0.034
0.43-0.44 0.048
0.44-0.45 0.041
0.45-0.46 0.027
0.46-0.47 0.058
0.47-0.48 0.082
0.48-0.49 0.106
0.49-0.50 0.733
0.50-0.51 0.746
0.51-0.52 0.089
0.52-0.53 0.048
0.53-0.54 0.061
0.54-0.55 0.055
0.55-0.56 0.048
0.56-0.57 0.044
0.57-0.58 0.027
0.58-0.59 0.041
0.59-0.60 0.034
0.60-0.61 0.051
0.61-0.62 0.027
0.62-0.63 0.024
0.63-0.64 0.034
0.64-0.65 0.048
0.65-0.66 0.051
0.66-0.67 0.068
0.67-0.68 0.055
0.68-0.69 0.034
0.69-0.70 0.051
0.70-0.71 0.014
0.71-0.72 0.017
0.72-0.73 0.034
0.73-0.74 0.041
0.74-0.75 0.024
0.75-0.76 0.041
0.76-0.77 0.034
0.77-0.78 0.041
0.78-0.79 0.041
0.79-0.80 0.037
0.80-0.81 0.027
0.81-0.82 0.024
0.82-0.83 0.048
0.83-0.84 0.055
0.84-0.85 0.034
0.85-0.86 0.068
0.86-0.87 0.048
0.87-0.88 0.034
0.88-0.89 0.068
0.89-0.90 0.055
0.90-0.91 0.044
0.91-0.92 0.065
0.92-0.93 0.102
0.93-0.94 0.075
0.94-0.95 0.092
0.95-0.96 0.133
0.96-0.97 0.174
0.97-0.98 0.204
0.98-0.99 0.378
0.99-1.00 46.581
I think the outside 1% can have their own range no problem.
If you take the next 9% it's 1.267% on the high side and 1.056% on the
low side, maybe enough.
The next 10% is only 0.461% on the high side and and 0.389% on low side,
probably not enough, and not much bigger for 20% so I'd make the next
slice 30% wide: 1.228% on the high side and 0.973% on the low side.
The middle is a bit more busy, I'd go back to 10% slices on either side
of 50%. 1.193% on the high side and 1.176% on the low side.
So, my guess is something like one of these (names are average score in
range rounded-off to nearest .05) rule sets:
body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.10')
body BAYES_25 eval:check_bayes('0.10', '0.40')
body BAYES_45 eval:check_bayes('0.40', '0.50')
body BAYES_55 eval:check_bayes('0.50', '0.60')
body BAYES_75 eval:check_bayes('0.60', '0.90')
body BAYES_95 eval:check_bayes('0.90', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')
or perhaps have a really indeterminate middle range:
body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.10')
body BAYES_30 eval:check_bayes('0.10', '0.49')
body BAYES_50 eval:check_bayes('0.49', '0.51')
body BAYES_70 eval:check_bayes('0.51', '0.90')
body BAYES_95 eval:check_bayes('0.90', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')
but, the middle 10% is pretty darn useless:
range spam% ham%
0.00-0.05 0.020 94.457
0.05-0.10 0.014 0.736
0.10-0.15 0.014 0.389
0.15-0.20 0.000 0.375
0.20-0.25 0.000 0.293
0.25-0.30 0.007 0.280
0.30-0.35 0.000 0.239
0.35-0.40 0.000 0.348
0.40-0.45 0.007 0.334
0.45-0.50 0.252 1.602 <- not great
0.50-0.55 1.526 0.627 <- not great
0.55-0.60 0.347 0.041
0.60-0.65 0.327 0.041
0.65-0.70 0.470 0.048
0.70-0.75 0.252 0.007
0.75-0.80 0.375 0.014
0.80-0.85 0.341 0.034
0.85-0.90 0.531 0.014
0.90-0.95 0.722 0.034
0.95-1.00 94.797 0.089
now, what if we break down the middle a bit more...
range spam% ham% S/O ratio
0.40-0.41 0.007 0.048 0.127273
0.41-0.42 0.000 0.041 0
0.42-0.43 0.000 0.068 0
0.43-0.44 0.000 0.095 0
0.44-0.45 0.000 0.082 0
0.45-0.46 0.000 0.055 0
0.46-0.47 0.000 0.116 0
0.47-0.48 0.014 0.150 0.0853659
0.48-0.49 0.014 0.198 0.0660377
0.49-0.50 0.286 1.180 0.195089 <- yuck
0.50-0.51 1.008 0.484 0.675603 <- yuck
0.51-0.52 0.143 0.034 0.80791 <- yuck
0.52-0.53 0.095 0.000 1
0.53-0.54 0.116 0.007 0.943089
0.54-0.55 0.102 0.007 0.93578
0.55-0.56 0.095 0.000 1
0.56-0.57 0.082 0.007 0.921348
0.57-0.58 0.041 0.014 0.745455
0.58-0.59 0.082 0.000 1
0.59-0.60 0.048 0.020 0.705882
In the end, I tried a bunch of middle ranges and it probably won't
matter much, even making the middle 40% wide doesn't really do all that
much for S/O ratios (which ultimately correlate well with the score).
Here are some possible ranges:
20% wide in middle:
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.10 0.020 2.093 0.00946522
0.10-0.40 0.020 1.923 0.0102934
0.40-0.60 2.132 2.605 0.450074
0.60-0.90 2.295 0.157 0.935971
0.90-0.99 2.465 0.068 0.973154
0.99-1.00 93.053 0.055 0.999409
10% wide in middle:
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.10 0.020 2.093 0.00946522
0.10-0.45 0.027 2.257 0.0118214
0.45-0.55 1.778 2.230 0.443613
0.55-0.90 2.643 0.198 0.930306
0.90-0.99 2.465 0.068 0.973154
0.99-1.00 93.053 0.055 0.999409
4% wide in middle
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.10 0.020 2.093 0.00946522
0.10-0.48 0.041 2.577 0.0156608
0.48-0.52 1.451 1.896 0.433523
0.52-0.90 2.956 0.211 0.933375
0.90-0.99 2.465 0.068 0.973154
0.99-1.00 93.053 0.055 0.999409
2% wide in middle
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.10 0.020 2.093 0.00946522
0.10-0.48 0.041 2.577 0.0156608
0.48-0.52 1.451 1.896 0.433523
0.52-0.90 2.956 0.211 0.933375
0.90-0.99 2.465 0.068 0.973154
0.99-1.00 93.053 0.055 0.999409
I like 10% wide option the best since the .10 to .45 range is more
usable than the 4% wide option and not quite as wide as the 20% option.
but, since the middle 20% got a score of zero in 2.6x, there's not much
reason to go any more narrow than that.
Yes, the middle is USELESS, so I don't want to waste rules on it.
I did a bit more tweaking. Before I go on too much longer, here's what
I liked the best:
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.05 0.007 1.357 0.00513196
0.05-0.40 0.034 2.659 0.0126253
0.40-0.60 2.132 2.605 0.450074
0.60-0.95 3.017 0.191 0.940461
0.95-0.99 1.744 0.034 0.980877
0.99-1.00 93.053 0.055 0.999409
body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.05')
body BAYES_25 eval:check_bayes('0.05', '0.40')
body BAYES_50 eval:check_bayes('0.40', '0.60')
body BAYES_75 eval:check_bayes('0.60', '0.95')
body BAYES_95 eval:check_bayes('0.95', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')
or perhaps if you want to break it down a bit more:
RANGE SPAM% HAM% S/O ratio
0.00-0.01 0.014 93.100 0.000150353
0.01-0.05 0.007 1.357 0.00513196
0.05-0.20 0.027 1.500 0.0176817
0.20-0.40 0.007 1.159 0.00600343
0.40-0.60 2.132 2.605 0.450074
0.60-0.80 1.423 0.109 0.928851
0.80-0.95 1.594 0.082 0.951074
0.95-0.99 1.744 0.034 0.980877
0.99-1.00 93.053 0.055 0.999409
body BAYES_00 eval:check_bayes('0.00', '0.01')
body BAYES_05 eval:check_bayes('0.01', '0.05')
body BAYES_10 eval:check_bayes('0.05', '0.20')
body BAYES_25 eval:check_bayes('0.20', '0.40')
body BAYES_50 eval:check_bayes('0.40', '0.60')
body BAYES_75 eval:check_bayes('0.60', '0.80')
body BAYES_90 eval:check_bayes('0.80', '0.95')
body BAYES_95 eval:check_bayes('0.95', '0.99')
body BAYES_99 eval:check_bayes('0.99', '1.00')
To do a quick comparison against the current BAYES (same mass-check),
it's obvious there are too many ranges because some sections are really
thin and the RANK doesn't change much for long stretches (denoted by
char at end of line)
45.552 0.0133 91.1482 0.000 1.00 0.00 BAYES_00 ***
0.304 0.0067 0.6008 0.011 0.56 0.00 BAYES_01 X
0.364 0.0000 0.7276 0.000 0.58 0.00 BAYES_02 X
0.367 0.0133 0.7210 0.018 0.58 0.00 BAYES_05 X
0.380 0.0133 0.7477 0.018 0.58 0.00 BAYES_10 X
0.284 0.0067 0.5607 0.012 0.56 0.00 BAYES_20 X
0.287 0.0000 0.5741 0.000 0.56 0.00 BAYES_30 X
0.127 0.0067 0.2470 0.026 0.50 0.00 BAYES_40 -
1.014 0.2934 1.7356 0.145 0.54 0.00 BAYES_44 -
1.034 1.5401 0.5274 0.745 0.48 0.00 BAYES_50 -
0.143 0.2467 0.0401 0.860 0.48 0.00 BAYES_56 -
0.434 0.7801 0.0868 0.900 0.54 0.00 BAYES_60 X
0.317 0.6134 0.0200 0.968 0.56 0.00 BAYES_70 X
0.450 0.8534 0.0467 0.948 0.58 0.00 BAYES_80 X
0.370 0.7067 0.0334 0.955 0.56 0.00 BAYES_90 X
0.500 0.9734 0.0267 0.973 0.61 0.00 BAYES_95 *
0.370 0.7334 0.0067 0.991 0.58 0.00 BAYES_98 **
45.602 91.0927 0.0534 0.999 0.98 0.00 BAYES_99 ***
If you had to make me pick, I'd go for the 9 rule variant above.
Daniel
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.