On 7 Nov 2018, at 14:33, Amir Caspi wrote:
Hi all,
In the past couple of weeks I've gotten a number of clearly-spam
messages that slipped past SA, and the only reason was because they
were getting low Bayes scores (BAYES_50 or even down to BAYES_00 or
BAYES_05). I do my Bayes training manually on both ham and spam so
there should not be any mis-categorizations... and things worked fine
until a few weeks ago, so I don't know what's going on now.
Here's the magic dump:
-bash-3.2$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db
version
0.000 0 253112 0 non-token data: nspam
0.000 0 106767 0 non-token data: nham
0.000 0 150434 0 non-token data: ntokens
0.000 0 1536087614 0 non-token data: oldest atime
0.000 0 1541617125 0 non-token data: newest atime
0.000 0 1541614751 0 non-token data: last journal
sync atime
0.000 0 1541614749 0 non-token data: last expiry
atime
0.000 0 5529600 0 non-token data: last expire
atime delta
0.000 0 1173 0 non-token data: last expire
reduction count
I don't see any obvious problem but I'm not an expert at interpreting
these...
The only useful info is the the number of spams and hams scanned (nham
and nspam) is well above the usage threshold and the fact that the
various timestamps (other than 'oldest atime') are reasonably recent. If
you happen not to live in Unix epoch time, the conversion is not hard:
# date -j -f %s 1541617125
Wed Nov 7 13:58:45 EST 2018
Do I need to completely trash and rebuild my DB, or am I missing
something obvious?
No and no.
Although it is perhaps helpful to recognize that Bayes is inherently
imperfect and will always be wrong about some messages.
In many cases, it would appear that these spams have either very
little (real) text (besides the usual attempt at Bayes poisoning)
and/or are using HTML-entity encoding to try to bypass Bayes. Here
are a couple of spamples:
https://pastebin.com/peiXZivJ
https://pastebin.com/3h3r7r7j
Those both have broken MIME structure, so SA can't treat the HTML part
as HTML. No MUA would render and display them correctly.
Assuming that you did that breakage yourself, intentionally: Stop doing
that. It is pointless and hampers any attempt to assist you. The only
things that could ever be private about spam are the target address and
internally-added headers.
Does SA decode HTML entities as part of normalize_charset? If not ...
can this be added?
I'm not entirely certain, but the documentation of bayes_token_sources
in Mail::SpamAssassin::Conf implies that HTML is rendered to text to the
point where SA can tell whether it is visible, which makes me suspect
that the entities get decoded. But that IS just a guess: I haven't
traced the code.
Empirically, I had SA learn a message with regular text in an HTML part
encoded as entities and then scanned a message with the same text as
text, and I got a 1.000 Bayes score (BAYES_999) for the second one. YMMV
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole