On Wed, 2 Jul 2014, Philip Prindeville wrote:

Given that it’s text/plain with an implicit charset=“us-ascii” and an implicit content-transfer-encoding of 7bit, the sequence &#x[0-9A-F]{4} doesn’t really parse into a 16-bit character, would it? That would be a broken MUA that made such a leap...

Nope. The content-transfer-encoding is only for the *transfer* part of the process. Once the content reaches the MUA that content can be further parsed by the MUA according to other encoding rules, such as these escape sequences for Unicode characters. That's perfectly valid. How else would you send, for example, a c-cedille in spanish text via a 7-bit-clean channel?

Wouldn’t that normally render as the character ‘&’, ‘#’, ‘x’, etc. rather than the unicode16 or UTF-8 character with that hex value?

I'd only expect that in a very old MUA (i.e. that does not support Unicode), or display of the raw message content at user request.

I wouldn’t want a message where someone gives a couple of examples of encoding &#x0400 for instance being flagged as SPAM, but if the text is 20% or more of these sequences then I would say that’s SPAM-sign.

That's valid 7-bit encoding for transfer. It's relying on the user's MUA to convert the encoded Unicode values to glyphs for display.

I would say that's more a case of those characters shouldn't be present if the language is en-us than an encoding issue. The presence of lots of those is either a sign that the text isn't English, or is obfuscated. How do you reliably tell the language of the message?

It would probably be a good idea to add those sequences to the replacetags letter REs so that the FUZZY rules will catch them.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.                          -- Arnold Toynbee
-----------------------------------------------------------------------
 2 days until the 238th anniversary of the Declaration of Independence

Reply via email to