Re: Bayes underperforming, HTML entities?

Bill Cole Thu, 08 Nov 2018 13:51:03 -0800

On 7 Nov 2018, at 14:33, Amir Caspi wrote:

Hi all,
In the past couple of weeks I've gotten a number of clearly-spammessages that slipped past SA, and the only reason was because theywere getting low Bayes scores (BAYES_50 or even down to BAYES_00 orBAYES_05). I do my Bayes training manually on both ham and spam sothere should not be any mis-categorizations... and things worked fineuntil a few weeks ago, so I don't know what's going on now.
Here's the magic dump:

-bash-3.2$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes dbversion
0.000          0     253112          0  non-token data: nspam
0.000          0     106767          0  non-token data: nham
0.000          0     150434          0  non-token data: ntokens
0.000          0 1536087614          0  non-token data: oldest atime
0.000          0 1541617125          0  non-token data: newest atime
0.000 0 1541614751 0 non-token data: last journalsync atime0.000 0 1541614749 0 non-token data: last expiryatime0.000 0 5529600 0 non-token data: last expireatime delta0.000 0 1173 0 non-token data: last expirereduction count
I don't see any obvious problem but I'm not an expert at interpretingthese...

The only useful info is the the number of spams and hams scanned (nhamand nspam) is well above the usage threshold and the fact that thevarious timestamps (other than 'oldest atime') are reasonably recent. Ifyou happen not to live in Unix epoch time, the conversion is not hard:


   # date -j -f %s 1541617125
   Wed Nov  7 13:58:45 EST 2018

Do I need to completely trash and rebuild my DB, or am I missingsomething obvious?


No and no.

Although it is perhaps helpful to recognize that Bayes is inherentlyimperfect and will always be wrong about some messages.

In many cases, it would appear that these spams have either verylittle (real) text (besides the usual attempt at Bayes poisoning)and/or are using HTML-entity encoding to try to bypass Bayes. Hereare a couple of spamples:
https://pastebin.com/peiXZivJ
https://pastebin.com/3h3r7r7j

Those both have broken MIME structure, so SA can't treat the HTML partas HTML. No MUA would render and display them correctly.

Assuming that you did that breakage yourself, intentionally: Stop doingthat. It is pointless and hampers any attempt to assist you. The onlythings that could ever be private about spam are the target address andinternally-added headers.

Does SA decode HTML entities as part of normalize_charset? If not ...can this be added?

I'm not entirely certain, but the documentation of bayes_token_sourcesin Mail::SpamAssassin::Conf implies that HTML is rendered to text to thepoint where SA can tell whether it is visible, which makes me suspectthat the entities get decoded. But that IS just a guess: I haven'ttraced the code.

Empirically, I had SA learn a message with regular text in an HTML partencoded as entities and then scanned a message with the same text astext, and I got a 1.000 Bayes score (BAYES_999) for the second one. YMMV


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Available For Hire: https://linkedin.com/in/billcole

Re: Bayes underperforming, HTML entities?

Reply via email to