I did some more research and I think I have to report the new discovery so that the thread can be useful to other Readers.
First: 0.000 0 5232 0 non-token data: nspam 0.000 0 70408 0 non-token data: nham 0.000 0 388070 0 non-token data: ntokens nspam and nham values are definitively the number of messages learnt. Second: I saw that nham increased every few seconds. I discovered that bayes_auto_learn was enabled ! My situation yesterday: 0.000 0 1042011 0 non-token data: nspam 0.000 0 66472 0 non-token data: nham 0.000 0 663479 0 non-token data: ntokens My situation now: 0.000 0 1042049 0 non-token data: nspam 0.000 0 71228 0 non-token data: nham 0.000 0 1040661 0 non-token data: ntokens So, at least, I now know that the system is feeding the bayes engine with some new data and that in this way the results can change. Third: in 72_active.cf there are a lot of bayes_ignore_header directives, but they don't include the ones added by my commercial antivirus. Should I create a patch? Fourth: I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it extracts from the message. I agree with some, I don't with others. I'd like to know if there is some doc that lists why tokens are extracted this way (some notes are in the source code) I discovered that probably some words should be added to the stopwords list but there is no way to do it in a configuration file, I should modify spamassassin code directly... To end: I think that the only way to proceed now is to nuke the bayes db and start from scratch: - setup bayes configuration correctly - double check the corpus to be correctly classified - run sa-learn For the "setup bayes configuration correctly" step I accept your contributions :-) I excluded all the headers of my antivirus and internal/external/trusted. Thanks Francesco