On 28.05.19 15:34, hg user wrote:
I did some more research and I think I have to report the new discovery so
that the thread can be useful to other Readers.
First:
0.000 0 5232 0 non-token data: nspam
0.000 0 70408 0 non-token data: nham
0.000 0 388070 0 non-token data: ntokens
nspam and nham values are definitively the number of messages learnt.
Second:
I saw that nham increased every few seconds. I discovered that
bayes_auto_learn was enabled !
My situation yesterday:
0.000 0 1042011 0 non-token data: nspam
0.000 0 66472 0 non-token data: nham
0.000 0 663479 0 non-token data: ntokens
My situation now:
0.000 0 1042049 0 non-token data: nspam
0.000 0 71228 0 non-token data: nham
0.000 0 1040661 0 non-token data: ntokens
So, at least, I now know that the system is feeding the bayes engine with
some new data and that in this way the results can change.
Third:
in 72_active.cf there are a lot of bayes_ignore_header directives, but they
don't include the ones added by my commercial antivirus. Should I create a
patch?
Fourth:
I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it
extracts from the message.
I agree with some, I don't with others. I'd like to know if there is some
doc that lists why tokens are extracted this way (some notes are in the
source code)
I discovered that probably some words should be added to the stopwords list
but there is no way to do it in a configuration file, I should modify
spamassassin code directly...
To end:
I think that the only way to proceed now is to nuke the bayes db and start
from scratch:
- setup bayes configuration correctly
- double check the corpus to be correctly classified
- run sa-learn
Do you still use Zimbra? if so, have you configured Zimbra?
Did you consult your Zimbra-man?
For the "setup bayes configuration correctly" step I accept your
contributions :-) I excluded all the headers of my antivirus and
internal/external/trusted.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.