[Resending because it looks like my first send went into a black hole...]

On 7 Nov 2018, at 14:33, Amir Caspi wrote:

Hi all,

In the past couple of weeks I've gotten a number of clearly-spam messages that slipped past SA, and the only reason was because they were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or BAYES_05). I do my Bayes training manually on both ham and spam so there should not be any mis-categorizations... and things worked fine until a few weeks ago, so I don't know what's going on now.

Here's the magic dump:

-bash-3.2$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000          0     253112          0  non-token data: nspam
0.000          0     106767          0  non-token data: nham
0.000          0     150434          0  non-token data: ntokens
0.000          0 1536087614          0  non-token data: oldest atime
0.000          0 1541617125          0  non-token data: newest atime
0.000 0 1541614751 0 non-token data: last journal sync atime 0.000 0 1541614749 0 non-token data: last expiry atime 0.000 0 5529600 0 non-token data: last expire atime delta 0.000 0 1173 0 non-token data: last expire reduction count


I don't see any obvious problem but I'm not an expert at interpreting these...

The only useful info is the the number of spams and hams scanned (nham and nspam) is well above the usage threshold and the fact that the various timestamps (other than 'oldest atime') are reasonably recent. If you happen not to live in Unix epoch time, the conversion is not hard:

   # date -j -f %s 1541617125
   Wed Nov  7 13:58:45 EST 2018


Do I need to completely trash and rebuild my DB, or am I missing something obvious?

No and no.

Although it is perhaps helpful to recognize that Bayes is inherently imperfect and will always be wrong about some messages.

In many cases, it would appear that these spams have either very little (real) text (besides the usual attempt at Bayes poisoning) and/or are using HTML-entity encoding to try to bypass Bayes. Here are a couple of spamples:

https://pastebin.com/peiXZivJ
https://pastebin.com/3h3r7r7j

Those both have broken MIME structure, so SA can't treat the HTML part as HTML. No MUA would render and display them correctly.

Assuming that you did that breakage yourself, intentionally: Stop doing that. It is pointless and hampers any attempt to assist you. The only things that could ever be private about spam are the target address and internally-added headers.

Does SA decode HTML entities as part of normalize_charset? If not ... can this be added?

I'm not entirely certain, but the documentation of bayes_token_sources in Mail::SpamAssassin::Conf implies that HTML is rendered to text to the point where SA can tell whether it is visible, which makes me suspect that the entities get decoded. But that IS just a guess: I haven't traced the code.

Empirically, I had SA learn a message with regular text in an HTML part encoded as entities and then scanned a message with the same text as text, and I got a 1.000 Bayes score (BAYES_999) for the second one. YMMV

--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Reply via email to