On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote:

This is a production mail gateway serving since 2015. I saw that a few messages (both hams and spams) automatically learned by amavisd/spamassassin. Today's statistics:

   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
    134 autolearn=unavailable

That's quite high for spam, ham, AND "unavailable" (which indicates something wrong with the Bayes subsystem, usually transient.) This seems like a recipe for a mis-learning disaster. For comparison, my 2018 autolearn counts:

spam: 418
ham: 15018
unavailable: 166
no: 129555

I also manually train any spam that gets through to me (the biggest spam target,) a small number of spams reported by others, and 'trap' hits. A wide variety of ham is harder to get for training but I have found it useful to give users a well-documented and simple way to help. One way is to look at what happens to mail AFTER delivery which can indicate that a message is ham without needing an admin to try to make a determination based on content. The simplest one is to learn anything users mark as $NotJunk as ham. Another is to create an "Archive" mailbox for every user and learn anything as ham that has been moved there a day after it is moved. The most important factor (especially in jurisdictions where human examination of email is a problem) is to tell users how to protect their email and then do what you tell them, robotically. In the US, Canada, and *SOME* of the EU, this is not risky. However, I have been told by people in *SOME* EU countries that they can't even robotically scan ANY mail content, so you shouldn't take my advice as authoritative: I'm not even a lawyer in the US, much less Hungary...

I think I have no control over what is learnt automatically.

Yes, you do. Run "perldoc Mail::SpamAssassin::Plugin::AutoLearnThreshold" for details.

You can set the learning thresholds, which control what gets learned. The defaults (0.1 and 12) mis-learn far too much spam as ham and not enough spam. I use -0.2 and 6, which means I don't autolearn a lot but everything I autolearn as ham has at least one hit on a substantial "nice" rule or 2 hits on weak ones.

There's a lot of vehemence against autolearn expressed here but not a lot of evidence that it operates poorly when configured wisely. The defaults are NOT wise.

Let's just assume for a moment that 1.4M ham-samples are valid.

Bad assumption. Your Bayes checks are uncertain about mail you've told SA is definitely spam. That's broken. It's a sort of breakage that cannot exist if you do not have a large quantity of spam that has been learned as ham.

Is there a ham:spam ratio I should stick to it?

No.

I presume if we have a 1:1 ratio then future messages won't be considered as spam as well.

The ham:spam ratio in the Bayes DB or its autolearning is not a generally useful metric. 1:1 is not magically good and neither is any other ratio, even with reference to a single site's mailstream. A very large ratio *on either side* indicates a likely problem in what is being learned, but you can't correlate the ratio to any particularly wrong bias in Bayes scoring. It is an inherently chaotic relationship. Factors that actually matter are correctness of learning, sample quality, and currency. You can control how current your Bayes DB is (USE AUTO-EXPIRE) but the other two factors are never going to be perfect.

Reply via email to