On 28.05.19 15:34, hg user wrote:
I did some more research and I think I have to report the new discovery so
that the thread can be useful to other Readers.

First:
0.000          0       5232          0  non-token data: nspam
0.000          0      70408          0  non-token data: nham
0.000          0     388070          0  non-token data: ntokens
nspam and nham values are definitively the number of messages learnt.

Second:
I saw that nham increased every few seconds. I discovered that
bayes_auto_learn was enabled !
My situation yesterday:
0.000          0    1042011          0  non-token data: nspam
0.000          0      66472          0  non-token data: nham
0.000          0     663479          0  non-token data: ntokens
My situation now:
0.000          0    1042049          0  non-token data: nspam
0.000          0      71228          0  non-token data: nham
0.000          0    1040661          0  non-token data: ntokens

So, at least, I now know that the system is feeding the bayes engine with
some new data and that in this way the results can change.

Third:
in 72_active.cf there are a lot of bayes_ignore_header directives, but they
don't include the ones added by my commercial antivirus. Should I create a
patch?

Fourth:
I added a dbg statement to bayes.pm, sub tokenize, to print the tokens it
extracts from the message.
I agree with some, I don't with others. I'd like to know if there is some
doc that lists why tokens are extracted this way (some notes are in the
source code)
I discovered that probably some words should be added to the stopwords list
but there is no way to do it in a configuration file, I should modify
spamassassin code directly...



To end:
I think that the only way to proceed now is to nuke the bayes db and start
from scratch:
- setup bayes configuration correctly
- double check the corpus to be correctly classified
- run sa-learn

Do you still use Zimbra? if so, have you configured Zimbra?
Did you consult your Zimbra-man?


For the "setup bayes configuration correctly" step I accept your
contributions :-) I excluded all the headers of my antivirus and
internal/external/trusted.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Spam is for losers who can't get business any other way.

Reply via email to