On Thu, 29 Apr 2010 08:25:29 -0400 Frank Bures <lisfr...@chem.toronto.edu> wrote:
> I've been running spamassassin for years. I am using auto-learn with > very conservative thresholds. However, after several years of usage > my spam database is about three time larger than my ham database and > I am starting to see false positives. > > Is there a way how to "shrink" the spam database? Yes there is. If you run sa-learn --backup you can process the database as a flat text file e.g. v 3 db_version # this must be the first line!!! v 1716 num_spam v 1281 num_nonspam t 1 0 1244129543 152a127dd2 t 1 0 1244399507 f6796cae57 t 1 0 1244336958 d585b2c212 t 1 0 1244458917 ff5612c891 t 1 0 1244842267 1414bea872 ... s h d4303932c1106a6b39161c7c7166db0bd295a...@sa_generated s h 09302920ab81b2442c9f95751d3f6aa3e76d9...@sa_generated s s 43e88c7bc1f9c7645a07e044f6714044a6014...@sa_generated s h 870687a6361e32521d0438c0a0ac494801caf...@sa_generated where v = metadata t = tokens s = signatures (for retraining) and where the format for tokens is: t Nspam Nham epoch-time hash what you need to do write a script that divides the metadata num_spam value and all the token Nspam counts by 3. The updated database can then be loaded back in with --restore. e.g. awk '/^(t|v.*num_spam)/ { $2=int($2/3)} /^(v|s)/ || ($2+$3)>0 {print}' <oldfile > newfile