>So I got an e-mail from an Outlook abuser that had some UTF-8 smart >quote characters in the Subject: line - sans RFC2047 encoding, just >bare UTF-8 characters, naked as the day they were typed, plonked in the >middle of the line. > >What *should* nmh do here (given that we don't have a way to tell it >was UTF-8 versus an ISO8859-N or 2022 or what-have-you)?
Technically ... those are legal nowadays. See RFC 6532. That's a message/global message. What should we do? We should deal with it. I think we might not do so well right now. Okay, fine, what does 'deal with it' mean? Well ... technically the only valid 'raw' 8-bit characters in headers are UTF-8. But I am aware that some busticated MUAs still send raw 8-bit data in other character sets. I see two possible sets of ways to deal with it better: 1) Assume any unencoded 8-bit characters in email headers are UTF-8. Treat as UTF-8, which means converting to local character set if necessary. If it turns out those bytes are not UTF-8, then either they'll fail character conversion or end up as mojibake on a user's terminal (well, they'll probably end up as the UTF-8 invalid character). 2) Do 1), except check first to see if all of the 8-bit sequences are valid UTF-8 encoding (it's possible for an arbitrary sequence of 8-bit characters to be a valid UTF-8 encoded sequence, but very unlikely). If it's all valid, treat as 1). Otherwise use substitution characters for everything 8-bit. --Ken _______________________________________________ Nmh-workers mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/nmh-workers
