> There are times when the Bayes database begins to misbehave, scoring > significant ham with BAYES_99 or significant spam with BAYES_00. > Whenever that happens, for whatever reason, wipe the database and > retrain (a good reason to keep 2-3k spam and 2-3k ham around, for a > quick retrain). >
Hi, my bayes looks like this: 0.000 0 2 0 non-token data: bayes db version 0.000 0 4588 0 non-token data: nspam 0.000 0 15006 0 non-token data: nham 0.000 0 148621 0 non-token data: ntokens 0.000 0 1088644104 0 non-token data: oldest atime 0.000 0 1089366749 0 non-token data: newest atime 0.000 0 1089366089 0 non-token data: last journal sync atime 0.000 0 1089335321 0 non-token data: last expiry atime 0.000 0 691200 0 non-token data: last expire atime delta 0.000 0 7297 0 non-token data: last expire reduction count I have been using it for a long time only with SA's autolearn, and recently I started training. Basically I train it only with false positives or false negatives (mistake-based learning). It seems to work fine, properly classifying spam and ham messages. Is my whole approach incorrect? Also, based on the above numbers of ham and spam, and considering that sa-learn's man page says that above 5,000 messages there is no significant improvement, how much more should I let it to grow? However, my experience says that, using a large number of SA rules, it would not be a problem to empty it, as the rules will most probably identify the spam. All I have to do is perform training in the same frequency I do it now (ie. it doesn't really matter if already manually 'learned' spams and hams are lost - my work remains the same!). It's a strange approach but it works for me (I have about 4,000 messages per day, of which about 40% is spam). I would appreciate any comments. Regards, Alty
