I have been using it for a long time only with SA's autolearn, and recently I started training. Basically I train it only with false positives or false negatives (mistake-based learning). It seems to work fine, properly classifying spam and ham messages. Is my whole approach incorrect?
Incorrect, no.. sub-optimal, yes.
It's a common misperception that there's no point to training messages that bayes already classifies correctly. Since bayes does token analysis, it's quite possible for a BAYES_99 message to still contain multiple spam-tokens never seen before by your bayes DB. These new tokens could be helpful in avoiding future FN cases.
Don't restrict yourself to mistake-based training, unless message training is overly difficult in the first place. Clearly it's more important to train mistakes, but that doesn't mean it's irrelevant to train non-mistakes.
Also, based on the above numbers of ham and spam, and considering that sa-learn's man page says that above 5,000 messages there is no significant improvement, how much more should I let it to grow?
Don't worry about the number of messages. Train forever and let the message counts climb infinitely.
The "no point in training more than 5,000 messages" really refers to "no point in training 5,000 messages at one time", because the token-count limits will wind up expiring almost as many tokens as you train after that. However, if you come back a week later and train 1,000 more, SA will have had some time to observe which tokens didn't get hit again, and will expire the least useful tokens from the old batch to make room for the new tokens.
SA automatically purges old tokens on LRU basis, so the size of the bayes DB is self-limited irrespective of the quantity of messages trained in the past.
Since SA expires tokens individually instead of on a "whole message" basis (which would be stupid anyway), there's really no way to know exactly how many emails currently have tokens in the database. Thus SA doesn't decrement the nspam/nham counters during expiry
For example, look at mine:
0.000 0 232809 0 non-token data: nspam 0.000 0 9644 0 non-token data: nham 0.000 0 183131 0 non-token data: ntokens
Vastly larger total number of messages trained than you have, yet my token count is not much higher because many of the tokens I trained in the past have been expired.
