Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

Bill Cole Tue, 13 Feb 2018 20:48:43 -0800

On 13 Feb 2018, at 9:33, Horváth Szabolcs wrote:

This is a production mail gateway serving since 2015. I saw that a fewmessages (both hams and spams) automatically learned byamavisd/spamassassin. Today's statistics:
   3616 autolearn=ham
  10076 autolearn=no
   2817 autolearn=spam
    134 autolearn=unavailable

That's quite high for spam, ham, AND "unavailable" (which indicatessomething wrong with the Bayes subsystem, usually transient.) This seemslike a recipe for a mis-learning disaster. For comparison, my 2018autolearn counts:


spam: 418
ham: 15018
unavailable: 166
no: 129555

I also manually train any spam that gets through to me (the biggest spamtarget,) a small number of spams reported by others, and 'trap' hits. Awide variety of ham is harder to get for training but I have found ituseful to give users a well-documented and simple way to help. One wayis to look at what happens to mail AFTER delivery which can indicatethat a message is ham without needing an admin to try to make adetermination based on content. The simplest one is to learn anythingusers mark as $NotJunk as ham. Another is to create an "Archive" mailboxfor every user and learn anything as ham that has been moved there a dayafter it is moved. The most important factor (especially injurisdictions where human examination of email is a problem) is to tellusers how to protect their email and then do what you tell them,robotically. In the US, Canada, and *SOME* of the EU, this is not risky.However, I have been told by people in *SOME* EU countries that theycan't even robotically scan ANY mail content, so you shouldn't take myadvice as authoritative: I'm not even a lawyer in the US, much lessHungary...

I think I have no control over what is learnt automatically.

Yes, you do. Run "perldocMail::SpamAssassin::Plugin::AutoLearnThreshold" for details.

You can set the learning thresholds, which control what gets learned.The defaults (0.1 and 12) mis-learn far too much spam as ham and notenough spam. I use -0.2 and 6, which means I don't autolearn a lot buteverything I autolearn as ham has at least one hit on a substantial"nice" rule or 2 hits on weak ones.

There's a lot of vehemence against autolearn expressed here but not alot of evidence that it operates poorly when configured wisely. Thedefaults are NOT wise.

Let's just assume for a moment that 1.4M ham-samples are valid.

Bad assumption. Your Bayes checks are uncertain about mail you've toldSA is definitely spam. That's broken. It's a sort of breakage thatcannot exist if you do not have a large quantity of spam that has beenlearned as ham.

Is there a ham:spam ratio I should stick to it?

No.

I presume if we have a 1:1 ratio then future messages won't beconsidered as spam as well.

The ham:spam ratio in the Bayes DB or its autolearning is not agenerally useful metric. 1:1 is not magically good and neither is anyother ratio, even with reference to a single site's mailstream. A verylarge ratio *on either side* indicates a likely problem in what is beinglearned, but you can't correlate the ratio to any particularly wrongbias in Bayes scoring. It is an inherently chaotic relationship. Factorsthat actually matter are correctness of learning, sample quality, andcurrency. You can control how current your Bayes DB is (USE AUTO-EXPIRE)but the other two factors are never going to be perfect.

Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam

Reply via email to