>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew >away from being sufficiently caffeinated to be positive, but that looks like >"not totally correct, but a lot closer than what we have now". > >In particular, that will accept overlong and illegal utf-8 codepoints, and >probably misbehaves in strange and unusual non-ascii/non-utf-8 things >like iso2022-jp.
So, the DETAILS are complicated. The address parser code is used for a lot of things. The specific bug report was about a draft message that contained Cyrillic characters. We know what that character set was in THAT case, because it's a draft message and we can derive the locale from the environment or the nmh locale setting. But if we are processing an email message then we don't easily know the character set. In theory it should either be us-ascii or utf-8, but reality sometimes intrudes and it could be anything. I think really instead of using ctype macros, we should be using a specific set of macros tailored for email addresses. Or a flex lexer designed to process those things. I kind of think that we should simply pass the input along as we are given rather than trying to validate that it is valid UTF-8 (for example). iso2022-jp is SO complicated, I don't think we should even try and I get the sense everyone is migrating to UTF-8 for email anyway. --Ken