[Bug 2419] Increased Bayes Score Breakdown Near Extremes

bugzilla-daemon 6 May 2004 15:53:54 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=2419






------- Additional Comments From [EMAIL PROTECTED]  2004-05-06 01:21 -------
Subject: Re:  Increased Bayes Score Breakdown Near Extremes


> Dan -- are you suggesting merging some of the intermediate ranges into
> a smaller number of rules?  care to make a concrete suggestion so we
> can close the bug? ;)

I'm suggesting recalibrating around the hits a bit more closely.  I
needed to do a bit more work to make a concrete suggestion...

> do these ranges make sense:
>
> 0.00, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.8, ... (mirror image on other end)

Here's a histogram of my Bayes percentages (ham and spam all lumped
together), autolearning, my current nightly corpus of 15k spam and 15k
ham.

 0.00-0.01 46.530
 0.01-0.02  0.310
 0.02-0.03  0.157
 0.03-0.04  0.136
 0.04-0.05  0.078
 0.05-0.06  0.085
 0.06-0.07  0.102
 0.07-0.08  0.055
 0.08-0.09  0.082
 0.09-0.10  0.051

 0.10-0.11  0.044
 0.11-0.12  0.041
 0.12-0.13  0.041
 0.13-0.14  0.044
 0.14-0.15  0.031
 0.15-0.16  0.034
 0.16-0.17  0.041
 0.17-0.18  0.031
 0.18-0.19  0.048
 0.19-0.20  0.034

 0.20-0.21  0.034
 0.21-0.22  0.041
 0.22-0.23  0.031
 0.23-0.24  0.027
 0.24-0.25  0.014
 0.25-0.26  0.031
 0.26-0.27  0.027
 0.27-0.28  0.031
 0.28-0.29  0.027
 0.29-0.30  0.027

 0.30-0.31  0.007
 0.31-0.32  0.007
 0.32-0.33  0.041
 0.33-0.34  0.041
 0.34-0.35  0.024
 0.35-0.36  0.024
 0.36-0.37  0.034
 0.37-0.38  0.048
 0.38-0.39  0.027
 0.39-0.40  0.041

 0.40-0.41  0.027
 0.41-0.42  0.020
 0.42-0.43  0.034
 0.43-0.44  0.048
 0.44-0.45  0.041
 0.45-0.46  0.027
 0.46-0.47  0.058
 0.47-0.48  0.082
 0.48-0.49  0.106
 0.49-0.50  0.733

 0.50-0.51  0.746
 0.51-0.52  0.089
 0.52-0.53  0.048
 0.53-0.54  0.061
 0.54-0.55  0.055
 0.55-0.56  0.048
 0.56-0.57  0.044
 0.57-0.58  0.027
 0.58-0.59  0.041
 0.59-0.60  0.034

 0.60-0.61  0.051
 0.61-0.62  0.027
 0.62-0.63  0.024
 0.63-0.64  0.034
 0.64-0.65  0.048
 0.65-0.66  0.051
 0.66-0.67  0.068
 0.67-0.68  0.055
 0.68-0.69  0.034
 0.69-0.70  0.051

 0.70-0.71  0.014
 0.71-0.72  0.017
 0.72-0.73  0.034
 0.73-0.74  0.041
 0.74-0.75  0.024
 0.75-0.76  0.041
 0.76-0.77  0.034
 0.77-0.78  0.041
 0.78-0.79  0.041
 0.79-0.80  0.037

 0.80-0.81  0.027
 0.81-0.82  0.024
 0.82-0.83  0.048
 0.83-0.84  0.055
 0.84-0.85  0.034
 0.85-0.86  0.068
 0.86-0.87  0.048
 0.87-0.88  0.034
 0.88-0.89  0.068
 0.89-0.90  0.055

 0.90-0.91  0.044
 0.91-0.92  0.065
 0.92-0.93  0.102
 0.93-0.94  0.075
 0.94-0.95  0.092
 0.95-0.96  0.133
 0.96-0.97  0.174
 0.97-0.98  0.204
 0.98-0.99  0.378
 0.99-1.00 46.581

I think the outside 1% can have their own range no problem.

If you take the next 9% it's 1.267% on the high side and 1.056% on the
low side, maybe enough.

The next 10% is only 0.461% on the high side and and 0.389% on low side,
probably not enough, and not much bigger for 20% so I'd make the next
slice 30% wide: 1.228% on the high side and 0.973% on the low side.

The middle is a bit more busy, I'd go back to 10% slices on either side
of 50%.  1.193% on the high side and 1.176% on the low side.

So, my guess is something like one of these (names are average score in
range rounded-off to nearest .05) rule sets:

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.10')
body BAYES_25           eval:check_bayes('0.10', '0.40')
body BAYES_45           eval:check_bayes('0.40', '0.50')
body BAYES_55           eval:check_bayes('0.50', '0.60')
body BAYES_75           eval:check_bayes('0.60', '0.90')
body BAYES_95           eval:check_bayes('0.90', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')

or perhaps have a really indeterminate middle range:

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.10')
body BAYES_30           eval:check_bayes('0.10', '0.49')
body BAYES_50           eval:check_bayes('0.49', '0.51')
body BAYES_70           eval:check_bayes('0.51', '0.90')
body BAYES_95           eval:check_bayes('0.90', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')

but, the middle 10% is pretty darn useless:

range      spam%   ham%
0.00-0.05  0.020 94.457
0.05-0.10  0.014  0.736
0.10-0.15  0.014  0.389
0.15-0.20  0.000  0.375
0.20-0.25  0.000  0.293
0.25-0.30  0.007  0.280
0.30-0.35  0.000  0.239
0.35-0.40  0.000  0.348
0.40-0.45  0.007  0.334
0.45-0.50  0.252  1.602 <- not great
0.50-0.55  1.526  0.627 <- not great
0.55-0.60  0.347  0.041
0.60-0.65  0.327  0.041
0.65-0.70  0.470  0.048
0.70-0.75  0.252  0.007
0.75-0.80  0.375  0.014
0.80-0.85  0.341  0.034
0.85-0.90  0.531  0.014
0.90-0.95  0.722  0.034
0.95-1.00 94.797  0.089

now, what if we break down the middle a bit more...

range     spam%  ham% S/O ratio
0.40-0.41 0.007 0.048 0.127273
0.41-0.42 0.000 0.041 0
0.42-0.43 0.000 0.068 0
0.43-0.44 0.000 0.095 0
0.44-0.45 0.000 0.082 0
0.45-0.46 0.000 0.055 0
0.46-0.47 0.000 0.116 0
0.47-0.48 0.014 0.150 0.0853659
0.48-0.49 0.014 0.198 0.0660377
0.49-0.50 0.286 1.180 0.195089  <- yuck
0.50-0.51 1.008 0.484 0.675603  <- yuck
0.51-0.52 0.143 0.034 0.80791   <- yuck
0.52-0.53 0.095 0.000 1
0.53-0.54 0.116 0.007 0.943089
0.54-0.55 0.102 0.007 0.93578
0.55-0.56 0.095 0.000 1
0.56-0.57 0.082 0.007 0.921348
0.57-0.58 0.041 0.014 0.745455
0.58-0.59 0.082 0.000 1
0.59-0.60 0.048 0.020 0.705882

In the end, I tried a bunch of middle ranges and it probably won't
matter much, even making the middle 40% wide doesn't really do all that
much for S/O ratios (which ultimately correlate well with the score).

Here are some possible ranges:

20% wide in middle:

RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.10  0.020  2.093 0.00946522
0.10-0.40  0.020  1.923 0.0102934
0.40-0.60  2.132  2.605 0.450074
0.60-0.90  2.295  0.157 0.935971
0.90-0.99  2.465  0.068 0.973154
0.99-1.00 93.053  0.055 0.999409

10% wide in middle:

RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.10  0.020  2.093 0.00946522
0.10-0.45  0.027  2.257 0.0118214
0.45-0.55  1.778  2.230 0.443613
0.55-0.90  2.643  0.198 0.930306
0.90-0.99  2.465  0.068 0.973154
0.99-1.00 93.053  0.055 0.999409

4% wide in middle
RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.10  0.020  2.093 0.00946522
0.10-0.48  0.041  2.577 0.0156608
0.48-0.52  1.451  1.896 0.433523
0.52-0.90  2.956  0.211 0.933375
0.90-0.99  2.465  0.068 0.973154
0.99-1.00 93.053  0.055 0.999409

2% wide in middle
RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.10  0.020  2.093 0.00946522
0.10-0.48  0.041  2.577 0.0156608
0.48-0.52  1.451  1.896 0.433523
0.52-0.90  2.956  0.211 0.933375
0.90-0.99  2.465  0.068 0.973154
0.99-1.00 93.053  0.055 0.999409

I like 10% wide option the best since the .10 to .45 range is more
usable than the 4% wide option and not quite as wide as the 20% option.
but, since the middle 20% got a score of zero in 2.6x, there's not much
reason to go any more narrow than that.

Yes, the middle is USELESS, so I don't want to waste rules on it.

I did a bit more tweaking.  Before I go on too much longer, here's what
I liked the best:

RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.05  0.007  1.357 0.00513196
0.05-0.40  0.034  2.659 0.0126253
0.40-0.60  2.132  2.605 0.450074
0.60-0.95  3.017  0.191 0.940461
0.95-0.99  1.744  0.034 0.980877
0.99-1.00 93.053  0.055 0.999409

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.05')
body BAYES_25           eval:check_bayes('0.05', '0.40')
body BAYES_50           eval:check_bayes('0.40', '0.60')
body BAYES_75           eval:check_bayes('0.60', '0.95')
body BAYES_95           eval:check_bayes('0.95', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')

or perhaps if you want to break it down a bit more:

RANGE      SPAM%   HAM% S/O ratio
0.00-0.01  0.014 93.100 0.000150353
0.01-0.05  0.007  1.357 0.00513196
0.05-0.20  0.027  1.500 0.0176817
0.20-0.40  0.007  1.159 0.00600343
0.40-0.60  2.132  2.605 0.450074
0.60-0.80  1.423  0.109 0.928851
0.80-0.95  1.594  0.082 0.951074
0.95-0.99  1.744  0.034 0.980877
0.99-1.00 93.053  0.055 0.999409

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.05')
body BAYES_10           eval:check_bayes('0.05', '0.20')
body BAYES_25           eval:check_bayes('0.20', '0.40')
body BAYES_50           eval:check_bayes('0.40', '0.60')
body BAYES_75           eval:check_bayes('0.60', '0.80')
body BAYES_90           eval:check_bayes('0.80', '0.95')
body BAYES_95           eval:check_bayes('0.95', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')

To do a quick comparison against the current BAYES (same mass-check),
it's obvious there are too many ranges because some sections are really
thin and the RANK doesn't change much for long stretches (denoted by
char at end of line)

 45.552   0.0133  91.1482    0.000   1.00    0.00  BAYES_00  ***
  0.304   0.0067   0.6008    0.011   0.56    0.00  BAYES_01  X
  0.364   0.0000   0.7276    0.000   0.58    0.00  BAYES_02  X
  0.367   0.0133   0.7210    0.018   0.58    0.00  BAYES_05  X
  0.380   0.0133   0.7477    0.018   0.58    0.00  BAYES_10  X
  0.284   0.0067   0.5607    0.012   0.56    0.00  BAYES_20  X
  0.287   0.0000   0.5741    0.000   0.56    0.00  BAYES_30  X
  0.127   0.0067   0.2470    0.026   0.50    0.00  BAYES_40  -
  1.014   0.2934   1.7356    0.145   0.54    0.00  BAYES_44  -
  1.034   1.5401   0.5274    0.745   0.48    0.00  BAYES_50  -
  0.143   0.2467   0.0401    0.860   0.48    0.00  BAYES_56  -
  0.434   0.7801   0.0868    0.900   0.54    0.00  BAYES_60  X
  0.317   0.6134   0.0200    0.968   0.56    0.00  BAYES_70  X
  0.450   0.8534   0.0467    0.948   0.58    0.00  BAYES_80  X
  0.370   0.7067   0.0334    0.955   0.56    0.00  BAYES_90  X
  0.500   0.9734   0.0267    0.973   0.61    0.00  BAYES_95  *
  0.370   0.7334   0.0067    0.991   0.58    0.00  BAYES_98  **
 45.602  91.0927   0.0534    0.999   0.98    0.00  BAYES_99  ***

If you had to make me pick, I'd go for the 9 rule variant above.

Daniel





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 2419] Increased Bayes Score Breakdown Near Extremes

Reply via email to