On Thu, 7 May 2020 11:39:07 -0700 (PDT) John Hardin wrote: > 100% 4-byte UTF8? That should be trivially easy to detect. > > Comments solicited. > > body __4BYTE_UTF8_WORD > /(?:\xf0\x9d[\x9a-\x9f][\x80-\xff]){3,10}/ tflags > __4BYTE_UTF8_WORD multiple, maxhits=10 meta > SUSP_UTF8_WORD_MANY __4BYTE_UTF8_WORD > 9 > > Potential FP for some languages because it's rather broad, it might > be possible to narrow it to just the 4-byte math glyphs that render > readable English text.
Actually it's not broad enough to cover even the mathematical letters. This covers them all without any overlap: /(?:\xf0\x9d[\x90-\x9f][\x80-\xbf]){3,10}/ It does include digits and Greek letters (the mathematical versions). Changing the continuation byte to [\x80-\xbf] may help a bit in avoiding matches on text that isn't actually UTF-8. It wont do any harm. I think the risk is mostly in matching actual mathematics. I doubt many people go to the trouble of entering these characters in emails, but perhaps something pasted into the body or found inside an attachment (if you have the appropriate plugin).