Bayes advanced questions

Michael Monnerie Wed, 10 May 2006 14:21:47 -0700

Dear SA users, I've had an offlist comparison of bayes DBs, and we found 
some interesting differences. We're trying to find out why bayes on 
server #1 makes better scores.:


Server #1 local.cf (SA 3.1.1):
bayes_expiry_max_db_size            2000000
bayes_auto_expire                   0
bayes_file_mode                     0777
bayes_auto_learn_threshold_spam     8.00
bayes_auto_learn_threshold_nonspam  1.0

Server #1 bayes files:
-rw-rw-rw-+  1 vscan      vscan  19738624 May 10 10:04 bayes_db_seen
-rw-rw-rw-+  1 vscan      vscan  41697280 May 10 10:04 bayes_db_toks

Server #1 bayes dump:
0.000          0      93053          0  non-token data: nspam
0.000          0      53428          0  non-token data: nham
0.000          0    1261864          0  non-token data: ntokens

Server #2 local.cf:
bayes_auto_learn    1
bayes_learn_to_journal  1
bayes_auto_expire   1
ok_languages        de en es
ok_locales          en

Server #2 bayes files:
  21M 2006-05-10 10:20 bayes_seen
 5,3M 2006-05-10 10:20 bayes_toks

Server #2 bayes dump:
0.000          0     155791          0  non-token data: nspam
0.000          0      80523          0  non-token data: nham
0.000          0     129852          0  non-token data: ntokens

From the numbers I would say that server #2 had learned more spam+ham, 
but has about 1/10th of tokens. That server is also far less accurate 
with bayes than server #1. Could the ntokens be the reason? 
With the new SPAM this last weeks, that tries to poison bayes, it could 
maybe be effective with the default of 150.000 tokens?


Another tip for all: With server #1 setting 
bayes_auto_learn_threshold_spam     8.00
you could expect this message to be autolearned:

> X-Spam-Status: Yes, hits=8.7 required=5.0 tests=BAYES_99=3.5, 
> HTML_MESSAGE=0.001,HTML_MIME_NO_HTML_TAG=0,HTML_TAG_EXIST_TBODY=0.282, 
> MIME_HTML_ONLY=0.389,RELAY_DE=0.01,REPLY_TO_EMPTY=0.512, 
> SARE_FORGED_EBAY=4 autolearn=no bayes=1.0000

But it is autolearn=no. This shows, that manual re-feeding SPAM can be 
effective for your Bayes, because this sure-is-spam would not have been 
learned automatically. Since it's already BAYES_99, you could say 
"don't bother, I'll be fine" *g* but bayes needs to be trained 
permanently, because tokens time out...

And why was SARE_FORGED_EBAY set down to 4? It was so nice at 100+...



Also, we set bayes_expiry_max_db_size to 50000, and made
sa-learn --force-expire --sync
But still those numbers:
>  0.000          0     242424          0  non-token data: nspam
>  0.000          0     313252          0  non-token data: nham
>  0.000          0     134001          0  non-token data: ntokens

Why are still 134k tokens there?

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660/4156531                          .network.your.ideas.
// PGP Key:   "lynx -source http://zmi.at/zmi3.asc | gpg --import"
// Fingerprint: 44A3 C1EC B71E C71A B4C2  9AA6 C818 847C 55CB A4EE
// Keyserver: www.keyserver.net                 Key-ID: 0x55CBA4EE

pgpFfLkAJB6Y1.pgp
Description: PGP signature

Bayes advanced questions

Reply via email to