On Thu, 7 May 2020 11:39:07 -0700 (PDT)
John Hardin wrote:

> 100% 4-byte UTF8? That should be trivially easy to detect.
> 
> Comments solicited.
> 
>    body       __4BYTE_UTF8_WORD
> /(?:\xf0\x9d[\x9a-\x9f][\x80-\xff]){3,10}/ tflags
> __4BYTE_UTF8_WORD     multiple, maxhits=10 meta
> SUSP_UTF8_WORD_MANY   __4BYTE_UTF8_WORD > 9
> 
> Potential FP for some languages because it's rather broad, it might
> be possible to narrow it to just the 4-byte math glyphs that render
> readable English text.

Actually it's not broad enough to cover even the mathematical
letters.

This covers them all without any overlap:

  /(?:\xf0\x9d[\x90-\x9f][\x80-\xbf]){3,10}/ 

It does include digits and Greek letters (the mathematical versions). 

Changing the continuation byte to [\x80-\xbf] may help a bit in
avoiding  matches on text that isn't actually UTF-8. It wont do any
harm.

I think the risk is mostly in matching actual mathematics. I doubt many
people go to the trouble of entering these characters in emails, but
perhaps something pasted into the body or found inside an attachment (if
you have the appropriate plugin). 



 

Reply via email to