Re: Detect Emoticons in Subject
On Thu, 20 May 2021 19:39:06 +0100 RW wrote: > > /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/ This includes the block mentioned by Bill Cole and and is simplified a bit /\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/ However, if you don't expect to get any legitimate mail with Asian languages in the subject, you can probably get away with including all 4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis and dead languages. /[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/
Re: Detect Emoticons in Subject
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote: > > Can someone explain why SA cannot support this type of syntax, or what would > be needed to get it supported? IMHO it makes it a lot easier for end-users > to understand a rule, and for rule developers to write or even contribute > new UTF-8-related rules, so it might be worth the effort to get it > supported? Perl strings internally would have to be UTF8. Mandatory prerequisite would be normalize_charset 1 in SA. Could be some cases where SA can't decode mails properly to UTF8, so it's a question mark what happens then. Some changes are coming already in 4.0, for example normalize_charset 1 will be default. But more complex internal/rule changes require a lot of thought on how to maintain backwards compatibility. I'm sure some people will still run 3.4 for years to come. Sorry to say but there are too few developers right now. It's up to the community to pick up the pace.
Re: Detect Emoticons in Subject
On 20-05-2021 18:19, RW wrote: On Thu, 20 May 2021 11:42:59 -0400 Clive Jacques wrote: Hi, I've been using SA a long time. Lately, I'm getting more and more spam with emoticons in the subject line. I'd say about 90% of my emails with emoticons in the subject are spam. I'd like to create a local rule which scores email with emoticons in the subject. # Local Rule for Emoticons in subject subjectEMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ The rule should start with "header", that's what's causing the lint failure. However, AFAIK, the rule still won't work because \p{Emoticons} isn't supported in spamassassin, which works on byte sequences. You need to rewrite it to match UTF-8 bytes. I'm not a real fan of very complex regular expressions, as they tend to get hard to read/understand very quickly. This thread is a perfect example: the syntax that the OP proposed (/\p{Emoticons}/) seems perfectly readable, and all the actually working alternatives are, with all respect to the authors, a nightmare to decipher. Especially for users not really proficient in regular expressions, the OP's syntax is perfectly understandable and all the alternatives aren't. I'm not really into the regex engine of perl/SA, so please correct if I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of the regex spec/perl (as opposed to pseudo-code, displaying something that actually doesn't exist). Can someone explain why SA cannot support this type of syntax, or what would be needed to get it supported? IMHO it makes it a lot easier for end-users to understand a rule, and for rule developers to write or even contribute new UTF-8-related rules, so it might be worth the effort to get it supported? Thanks in advance, Tom
Re: Detect Emoticons in Subject: CHAOS
On 2021-05-20 22:33, Clive Jacques wrote: Here is a good example of such an email (attached, stripped of identifying info). This attachment is suspicious because its type doesn't match the type declared in the message. If you do not trust the sender, you shouldn't open it in the browser because it may contain malicious contents. Expected: text/plain (.txt); found: message/rfc822 (.eml) should i ignore roundcube warnings ? :)
Re: Detect Emoticons in Subject: CHAOS
On Thu, 20 May 2021 15:35:21 -0400 Jared Hall wrote: > Clive Jacques wrote: > > # Local Rule for Emoticons in subject > > subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ > > The following regex will detect a good amount of Emojis: > > |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug > > | That doesn't work in SA for the same reason that \p{Emoticons} doesn't work.
Re: Detect Emoticons in Subject: CHAOS
Clive Jacques wrote: Hi, I've been using SA a long time. Lately, I'm getting more and more spam with emoticons in the subject line. I'd say about 90% of my emails with emoticons in the subject are spam. I'd like to create a local rule which scores email with emoticons in the subject. I saw a previous discussion on this in the archive, but it was focused on whether such emails were /always /spam. I think an emoticon rule, in combination with other rules, will help my installation. I've tried to match as follows, but it won't lint. I'm not really a perl programmer. I've written several other more conventional local rules, but here I'm a bit out of my depth. I'd appreciate some guidance. # Local Rule for Emoticons in subject subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ score EMOTICON_IN_SUBJECT 3.0 describe EMOTICON_IN_SUBJECT Subject Line Has Emoticons -CJ The following regex will detect a good amount of Emojis: |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug | Ref: https://stackoverflow.com/questions/43242440/javascript-unicode-emoji-regular-expressions/45138005#45138005 But it is not the greatest thing if you want to get a count out of that. However, I may have a solution for you with the CHAOS plugin: https://github.com/telecom2k3/CHAOS You can get (but shouldn't) Emojis even in From names, like this actual one: DHL☺com CHAOS will also help you with Unicode Character spoofs, via its UniBabble rulesets: ᴀмαzσи ᴘ𝔯𝔦𝔪ё 𝘼𝔪𝔞𝘻𝙤𝘯 𝘾𝘶𝘴𝙩𝙤𝘮𝘦𝘳 𝙎𝔢𝘳𝙫𝘪𝘤𝔢 Amαzoɴ Priⅿë 🅰🅼🅰🆉🅾🅽 🆂🅴🆁🆅🅸🅲🅴 𝐀𝐦𝐚𝐳𝐨𝐧 𝐍𝐨𝐭𝐢𝐜𝐞 ... ... CHAOS will run on PERL 5.18 and later. -- Jared Hall
Re: Detect Emoticons in Subject
That's fine - I'm not saying all email containing emojis in the subject (or elsewhere) *is *spam - just that it's uncommon and right now, about 90% of the time it is *for me*. I just want to score it as part of the greater constellation of factors (just like DKIM, SPF etc.). On Thu, May 20, 2021 at 2:48 PM Bill Cole < sausers-20150...@billmail.scconsult.com> wrote: > > People send wanted mail with all sorts of weirdness. > >
Re: Detect Emoticons in Subject
On 2021-05-20 at 13:44:43 UTC-0400 (Thu, 20 May 2021 18:44:43 +0100) RW is rumored to have said: On Thu, 20 May 2021 18:30:03 +0100 RW wrote: Try this: header EMOTICON_IN_SUBJECT Subject =~ /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/ Actually that's only the original block, but it probably works most of the time Not so sure about that... I regularly get mail from Patreon with emoji in the encoded header which don't match that pattern: # grep '^Subject: ' /tmp/ham |cut -d? -f4 |decode-base64 |hexdump -C f0 9f 8e 89 20 50 61 74 72 69 63 6b 20 57 61 72 | Patrick War| 0010 64 6c 65 20 6a 75 73 74 20 73 68 61 72 65 64 20 |dle just shared | 0020 22 f0 9f 93 9d 20 4e |" N| 0027 People send wanted mail with all sorts of weirdness. Looking at the full set (https://www.unicode.org/emoji/charts/full-emoji-list.html) I can understand why \p{Emoticons} would be so much better than trying to define them all in a regex of hex bytes in UTF-8 form. -- Bill Cole b...@scconsult.com or billc...@apache.org (AKA @grumpybozo and many *@billmail.scconsult.com addresses) Not Currently Available For Hire
Re: Detect Emoticons in Subject
On Thu, 20 May 2021 19:26:30 +0100 RW wrote: > On Thu, 20 May 2021 18:44:43 +0100 > RW wrote: > > > On Thu, 20 May 2021 18:30:03 +0100 > > RW wrote: > > > > > > > Try this: > > > > > > > > > header EMOTICON_IN_SUBJECT Subject =~ > > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/ > > > > > > > Actually that's only the original block, but it probably works most > > of the time > > This extends it to Supplemental Symbols and Pictographs and > adds the three original faces from Miscellaneous Symbols > > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/ > > it also fixes a minor problem with a continuation bytes in the > original. > I still didn't get continuity bytes right, I forgot that bit 6 is always 0 - it's a long time since I've done this. /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
Re: Detect Emoticons in Subject
On Thu, 20 May 2021 18:44:43 +0100 RW wrote: > On Thu, 20 May 2021 18:30:03 +0100 > RW wrote: > > > > Try this: > > > > > > header EMOTICON_IN_SUBJECT Subject =~ > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/ > > > > Actually that's only the original block, but it probably works most of > the time This extends it to Supplemental Symbols and Pictographs and adds the three original faces from Miscellaneous Symbols /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/ it also fixes a minor problem with a continuation bytes in the original.
Re: Detect Emoticons in Subject
On Thu, 20 May 2021 18:30:03 +0100 RW wrote: > Try this: > > > header EMOTICON_IN_SUBJECT Subject =~ > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/ > Actually that's only the original block, but it probably works most of the time
Re: Detect Emoticons in Subject
On Thu, 20 May 2021 18:34:54 +0200 Bert Van de Poel wrote: > We've started getting lots of spam with emoji in the subject too the > past few weeks, so I've looked into this as well. As mentioned by RW, > you would need to create some kind of UTF8 regex header Subject rule. > As I'm not too excited about writing such a regex, it's way at the > bottom of my todo list to contemplate whether an SA plugin could be > written for that and to then reach out to the SA developers to see > whether that would be something upstream would accept. But honestly, > I won't be able to any time soon (I don't have the time). Still, > thought I'd mention it, since it might be relevant to your question. > If you do end up figuring out a regex that works out and isn't an > extreme length, I think plenty of people on this list would love to > know! Try this: header EMOTICON_IN_SUBJECT Subject =~ /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
Re: Detect Emoticons in Subject
On Thu, 2021-05-20 at 18:34 +0200, Bert Van de Poel wrote: > We've started getting lots of spam with emoji in the subject too the > past few weeks, so I've looked into this as well. As mentioned by RW, > you would need to create some kind of UTF8 regex header Subject rule. As > I'm not too excited about writing such a regex, it's way at the bottom > of my todo list > Should be easy enough - IsASCII is just a name for [\x00-\x7f] and IsXDigit is [0-9a-fA-F], so the same logic can be applied to define a regex that triggers on any character within the three Unicode emoji ranges. See Wikipedia doe more detail: https://en.wikipedia.org/wiki/Emoticon#Unicode I haven't yet seen any emojis in Subject lines, regardless of whether the message was spam or not, or I'd probably have already written such a rule and given it a minimal score so it can be used in a more spam- specific meta rule. Martin
Re: Detect Emoticons in Subject
We've started getting lots of spam with emoji in the subject too the past few weeks, so I've looked into this as well. As mentioned by RW, you would need to create some kind of UTF8 regex header Subject rule. As I'm not too excited about writing such a regex, it's way at the bottom of my todo list to contemplate whether an SA plugin could be written for that and to then reach out to the SA developers to see whether that would be something upstream would accept. But honestly, I won't be able to any time soon (I don't have the time). Still, thought I'd mention it, since it might be relevant to your question. If you do end up figuring out a regex that works out and isn't an extreme length, I think plenty of people on this list would love to know! Bert On 20/05/2021 18:19, RW wrote: On Thu, 20 May 2021 11:42:59 -0400 Clive Jacques wrote: Hi, I've been using SA a long time. Lately, I'm getting more and more spam with emoticons in the subject line. I'd say about 90% of my emails with emoticons in the subject are spam. I'd like to create a local rule which scores email with emoticons in the subject. # Local Rule for Emoticons in subject subjectEMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ The rule should start with "header", that's what's causing the lint failure. However, AFAIK, the rule still won't work because \p{Emoticons} isn't supported in spamassassin, which works on byte sequences. You need to rewrite it to match UTF-8 bytes.
Re: Detect Emoticons in Subject
On Thu, 20 May 2021 11:42:59 -0400 Clive Jacques wrote: > Hi, > > I've been using SA a long time. Lately, I'm getting more and more > spam with emoticons in the subject line. I'd say about 90% of my > emails with emoticons in the subject are spam. I'd like to create a > local rule which scores email with emoticons in the subject. > # Local Rule for Emoticons in subject > subjectEMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ The rule should start with "header", that's what's causing the lint failure. However, AFAIK, the rule still won't work because \p{Emoticons} isn't supported in spamassassin, which works on byte sequences. You need to rewrite it to match UTF-8 bytes.
Detect Emoticons in Subject
Hi, I've been using SA a long time. Lately, I'm getting more and more spam with emoticons in the subject line. I'd say about 90% of my emails with emoticons in the subject are spam. I'd like to create a local rule which scores email with emoticons in the subject. I saw a previous discussion on this in the archive, but it was focused on whether such emails were *always *spam. I think an emoticon rule, in combination with other rules, will help my installation. I've tried to match as follows, but it won't lint. I'm not really a perl programmer. I've written several other more conventional local rules, but here I'm a bit out of my depth. I'd appreciate some guidance. # Local Rule for Emoticons in subject subjectEMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/ score EMOTICON_IN_SUBJECT 3.0 describeEMOTICON_IN_SUBJECT Subject Line Has Emoticons -CJ