Re: Bayes spam and ham out of proportion

RW Thu, 29 Apr 2010 08:20:42 -0700

On Thu, 29 Apr 2010 08:25:29 -0400
Frank Bures <lisfr...@chem.toronto.edu> wrote:


> I've been running spamassassin for years.  I am using auto-learn with
> very conservative thresholds.  However, after several years of usage
> my spam database is about three time larger than my ham database and
> I am starting to see false positives.
> 
> Is there a way how to "shrink" the spam database?


Yes there is.

If you run sa-learn --backup you can process the database as a flat text
file e.g.

v       3       db_version # this must be the first line!!!
v       1716    num_spam
v       1281    num_nonspam
t       1       0       1244129543      152a127dd2
t       1       0       1244399507      f6796cae57
t       1       0       1244336958      d585b2c212
t       1       0       1244458917      ff5612c891
t       1       0       1244842267      1414bea872
...
s       h       d4303932c1106a6b39161c7c7166db0bd295a...@sa_generated
s       h       09302920ab81b2442c9f95751d3f6aa3e76d9...@sa_generated
s       s       43e88c7bc1f9c7645a07e044f6714044a6014...@sa_generated
s       h       870687a6361e32521d0438c0a0ac494801caf...@sa_generated

where 

v = metadata
t = tokens
s = signatures  (for retraining)

and where the format for tokens is: 
t Nspam Nham epoch-time hash


what you need to do write a script that divides the metadata num_spam
value and all the token Nspam counts by 3. The updated database can
then be loaded back in with --restore.

e.g. 
awk '/^(t|v.*num_spam)/ { $2=int($2/3)} /^(v|s)/ || ($2+$3)>0
{print}'  <oldfile > newfile

Re: Bayes spam and ham out of proportion

Reply via email to