Re: Bug reported regarding Unicode handling in email address

Ken Hornstein Wed, 02 Jun 2021 14:48:01 -0700

>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
>away from being sufficiently caffeinated to be positive, but that looks like
>"not totally correct, but a lot closer than what we have now".
>
>In particular, that will accept overlong and illegal utf-8 codepoints, and
>probably misbehaves in strange and unusual non-ascii/non-utf-8 things
>like iso2022-jp.


So, the DETAILS are complicated.

The address parser code is used for a lot of things.  The specific bug
report was about a draft message that contained Cyrillic characters.
We know what that character set was in THAT case, because it's a draft
message and we can derive the locale from the environment or the nmh
locale setting.  But if we are processing an email message then we don't
easily know the character set.  In theory it should either be us-ascii
or utf-8, but reality sometimes intrudes and it could be anything.

I think really instead of using ctype macros, we should be using a
specific set of macros tailored for email addresses.  Or a flex
lexer designed to process those things.  I kind of think that we
should simply pass the input along as we are given rather than trying
to validate that it is valid UTF-8 (for example).  iso2022-jp is
SO complicated, I don't think we should even try and I get the sense
everyone is migrating to UTF-8 for email anyway.

--Ken

Re: Bug reported regarding Unicode handling in email address

Reply via email to