Hello I thought of submitting a patch via Bugzilla, but then decided to first ask and check that I understood the general principles of body checks, and SpamAssassin's current approach to Unicode. Apologies for the length of this message. I hope the main points make sense.
A fair number of webcam bitcoin 'sextortion' scams have evaded detection and worried recipients because of including relevant credentials. (Incidentally, I assume the credentials and addresses are mostly from the 2012 LinkedIn breach, but someone on the RIPE abuse list reports Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of this spam, but on writing body regexes to catch the wave around 16 October, I noticed that my rules weren't matching because the source was liberally injected with invisible characters: Content preview: I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of your pa<U+200C>ss. L<U+200C>ets g<U+200C>et strai<U+200C>ght to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e These characters are encoded as decimal HTML entities ‌ and in the text/plain part as UTF-8 byte sequences. Without working these characters into a body rule pattern, that pattern will not match, yet such Unicode 'format' characters barely affect display or legibility, if at all. This could be a more general concern about obfuscation. Invisible characters could be used to evade all the ADVANCE_FEE* rules for example. There are over 150 non-printing 'Format' characters in Unicode: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Format:] I find it counterintuitive that such non-printing characters match [:print:] and [:graph:] rather than [:cntrl:], but this is how the classes are defined at: https://www.unicode.org/reports/tr18/#Compatibility_Properties As minor points, 'Format' excludes a couple of separator characters in the same range that instead match [:space:] https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:] Then there is the C1 [:cntrl:] set, which some MUA's may render silently, I think including the 0x9D matched by the recent __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?): https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:] Finally, there may be a case for including as 'almost' invisible narrow blanks like U+200A   U+202F and maybe U+205F. The Perl Unicode database may not be completely up-to-date here, and Perl 5.18 doesn't recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24 does. I've also seen many format characters in legitimate email, including in the middle of 7-bit ASCII text. Google uses 0xFEFF (BOM) as a zero-width word joiner (use deprecated since 2002), and U+200C apparently occurs in corporate sigs. So their mere presence isn't much evidence of obfuscation. I presume they may prevent legitimate patterns being matched, including by Bayes. So my patch was going to be something to eliminate Format characters from get_rendered_body_text_array() like: --- lib/Mail/SpamAssassin/Message.pm (revision 1844922) +++ lib/Mail/SpamAssassin/Message.pm (working copy) @@ -1167,6 +1167,8 @@ $text =~ s/\n+\s*\n+/\x00/gs; # double newlines => null # $text =~ tr/ \t\n\r\x0b\xa0/ /s; # whitespace (incl. VT, NBSP) => space # $text =~ tr/ \t\n\r\x0b/ /s; # whitespace (incl. VT) => single space + # do not render zero-width Unicode characters used as obfuscation: + $text =~ s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs; $text =~ s/\s+/ /gs; # Unicode whitespace => single space $text =~ tr/\x00/\n/; # null => newline One problem here is that I'm not clear at this point if $text is a intended to be a character string (UTF8 flag set) or a byte string, and the code immediately following tests this with `if utf8::is_utf8($text)`. \p{Format} includes U+00AD (soft hyphen), which is also a continuation byte in UTF-8 encoding such as in the letter 'í' (LATIN SMALL LETTER I WITH ACUTE), so might be incorrectly removed if $text is a byte string. Prior to SA 3.4.1, it seems sometimes body rules would be matching against a character string, and sometimes against a binary string. This is mentioned in bug 7490, where a single '.' was matching 'á' until version SA 3.4.1. As a postscript to that bug, I suspect what was happening was 'normalize_charset 1' was set, and _normalize() was attempting utf8::downgrade() but failed, perhaps because the message contained some non-Latin-1 text. On the other hand, will `s/\s+/ /gs` fail to normalise all Unicode [:blank:] characters correctly unless $text is marked as a character string? What are the design decisions here? Can I find them on this list, the wiki or elsewhere? Also what is the approach to 7-bit characters [\x00-\x1f\x7f] ? Here are some significant commits that seem to be work make the process of decoding and rendering more reliable and more like email client display but don't solve the format character issue: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message.pm?r1=1707582&r2=1707597 http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Message/Node.pm?r1=1749286&r2=1749798 IMHO it would be nice if it were possible to change related behaviour via a plugin, at the parsed_metadata() or start_rules() hook, but AFAICS there is no way for a plugin to alter the rendered message. You can use `replace_rules`/`replace_tag` to pre-process a rule (this fuzziness has the advantage that the same code-point may obfuscate, say, both I and L, but doesn't help much with invisible characters at the moment). However, there is nothing to pre-process and canonicalise the text being matched to simplify rule writing. I have often been unclear on what I need to do to get a body rule to match accented or Cyrillic characters, sometimes checking the byte stream in different encodings and transcribing to hex by hand. 'rawbody' rules should no doubt match the encoded 'raw' data, but I wonder if 'body' rules would work better if they concentrated on the meaning of the words without having to worry about multiple possible encodings and transmission systems. So if I can venture a radical suggestion, should body rules actually match against a character string, as they have sometimes been doing apparently unintentionally? Could this be a configuration setting, as a function of or in addition to normalize_charset? Very little cannot be represented in a character string, which seems to be Perl's preferred model since version 5.8. Although there may be some obscure encodings that could require some work to decode, is it better to decode and normalise what can be decoded reasonably reliably, and represent the rest as Unicode code points with the same value as the bytes? (That should match \xNN for rare encodings.) Is there still a performance issue? To make such functionality (if enabled) as compatible as possible with existing rulesets, the Conf module might detect valid UTF-8 literals in body regexes and decode those, and where there are \xNN escape sequences (up to 62 subrules in main rules), if they form valid contiguous UTF-8, they can be decoded too. Where there are more complex sequences like __BENEFICIARY or \xef(?:\xbf[\xb9-\xbb]|\xbb\xbf), then perhaps those should have been in rawbody rules anyway, or rewritten to be encoding-independent and eliminate any finesses of Unicode like the Format characters? I'd be grateful for advice as to whether there's merit in filing these concerns as one or more issues on Bugzilla, or for relevant background. CK
