On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote:
This is a production mail gateway serving since 2015. I saw that a few
messages (both hams and spams) automatically learned by
amavisd/spamassassin. Today's statistics:
3616 autolearn=ham
10076 autolearn=no
2817 autolearn=spam
134 autolearn=unavailable
That's quite high for spam, ham, AND "unavailable" (which indicates
something wrong with the Bayes subsystem, usually transient.) This seems
like a recipe for a mis-learning disaster. For comparison, my 2018
autolearn counts:
spam: 418
ham: 15018
unavailable: 166
no: 129555
I also manually train any spam that gets through to me (the biggest spam
target,) a small number of spams reported by others, and 'trap' hits. A
wide variety of ham is harder to get for training but I have found it
useful to give users a well-documented and simple way to help. One way
is to look at what happens to mail AFTER delivery which can indicate
that a message is ham without needing an admin to try to make a
determination based on content. The simplest one is to learn anything
users mark as $NotJunk as ham. Another is to create an "Archive" mailbox
for every user and learn anything as ham that has been moved there a day
after it is moved. The most important factor (especially in
jurisdictions where human examination of email is a problem) is to tell
users how to protect their email and then do what you tell them,
robotically. In the US, Canada, and *SOME* of the EU, this is not risky.
However, I have been told by people in *SOME* EU countries that they
can't even robotically scan ANY mail content, so you shouldn't take my
advice as authoritative: I'm not even a lawyer in the US, much less
Hungary...
I think I have no control over what is learnt automatically.
Yes, you do. Run "perldoc
Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details.
You can set the learning thresholds, which control what gets learned.
The defaults (0.1 and 12) mis-learn far too much spam as ham and not
enough spam. I use -0.2 and 6, which means I don't autolearn a lot but
everything I autolearn as ham has at least one hit on a substantial
"nice" rule or 2 hits on weak ones.
There's a lot of vehemence against autolearn expressed here but not a
lot of evidence that it operates poorly when configured wisely. The
defaults are NOT wise.
Let's just assume for a moment that 1.4M ham-samples are valid.
Bad assumption. Your Bayes checks are uncertain about mail you've told
SA is definitely spam. That's broken. It's a sort of breakage that
cannot exist if you do not have a large quantity of spam that has been
learned as ham.
Is there a ham:spam ratio I should stick to it?
No.
I presume if we have a 1:1 ratio then future messages won't be
considered as spam as well.
The ham:spam ratio in the Bayes DB or its autolearning is not a
generally useful metric. 1:1 is not magically good and neither is any
other ratio, even with reference to a single site's mailstream. A very
large ratio *on either side* indicates a likely problem in what is being
learned, but you can't correlate the ratio to any particularly wrong
bias in Bayes scoring. It is an inherently chaotic relationship. Factors
that actually matter are correctness of learning, sample quality, and
currency. You can control how current your Bayes DB is (USE AUTO-EXPIRE)
but the other two factors are never going to be perfect.