On 2023-07-28 at 00:26:51 UTC-0400 (Thu, 27 Jul 2023 23:26:51 -0500
(CDT))
David B Funk <users@spamassassin.apache.org>
is rumored to have said:
On Fri, 28 Jul 2023, Jared Hall wrote:
On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
Hey, all. I've recently started getting spam that's really hard to
deal with, and I'm open to suggestions as to how to approach it.
Superficially,
[snip..]
The damn body's been encoded! And there's so little in there that
it's not triggering on many rules (e.g., Bayesian doesn't go over
20%). If anyone has a bright idea -- maybe a way to decode the
attachments and run a regex against _that_? -- I'm all ears.
1. There are milters/content-filters that decode Base64 message
parts (amavisd-new, mimedefang, etc) for processing by SA.
2. There are still sufficiently unique items: First-Name-Only,
Mixed-Case word in the Subject (NLP modeling), and a Base-64 encoded
HTML attachment (w/ UTF-8 encoding no less). Combined in a Meta
rule, these innocuous items will likely hit with good accuracy even
without Base64 decoding.
Umm, unless I'm really missing something here the usual SA processing
decodes such body stuff (QP, Base64, etc) and feeds the "cleaned" text
to the rule processing engine.
Correct. It has nothing to do with the calling glue.
You have to work hard to get matches done on the raw stuff if you want
to do special rule matching on the un-decoded body.
Correct. That should only be needed in rare cases where you're looking
for a pattern in a non-text part.
I'm not sure why the OP's rule didn't match the target message, but it
is NOT because of the Base64 encoding of parts with the 'text' primary
MIME type. If I had to guess, I'd look for invisible characters hidden
in the text (e.g. Unicode "zero width non-joiner" marks and the like)
that break the pattern and for lookalike non-ASCII characters (often
Cyrillic or Greek) in the target string.
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire