Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
>away from being sufficiently caffeinated to be positive, but that looks like
>"not totally correct, but a lot closer than what we have now".
>
>In particular, that will accept overlong and illegal utf-8 codepoints, and
>probably misbehaves in strange and unusual non-ascii/non-utf-8 things
>like iso2022-jp.

So, the DETAILS are complicated.

The address parser code is used for a lot of things.  The specific bug
report was about a draft message that contained Cyrillic characters.
We know what that character set was in THAT case, because it's a draft
message and we can derive the locale from the environment or the nmh
locale setting.  But if we are processing an email message then we don't
easily know the character set.  In theory it should either be us-ascii
or utf-8, but reality sometimes intrudes and it could be anything.

I think really instead of using ctype macros, we should be using a
specific set of macros tailored for email addresses.  Or a flex
lexer designed to process those things.  I kind of think that we
should simply pass the input along as we are given rather than trying
to validate that it is valid UTF-8 (for example).  iso2022-jp is
SO complicated, I don't think we should even try and I get the sense
everyone is migrating to UTF-8 for email anyway.

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Valdis Klētnieks
On Wed, 02 Jun 2021 00:13:51 -0400, Ken Hornstein said:
> So this bug was reported yesterday:
>
>   https://savannah.nongnu.org/bugs/?60713

> I am wondering if the simplest solution is to put in isascii() in front
> of those tests in that function.  We only really care about those tests
> returning "true" for ASCII characters.  Thoughts?

It's early morning for me, and I'm still at least a liter of Diet Mountain Dew
away from being sufficiently caffeinated to be positive, but that looks like
"not totally correct, but a lot closer than what we have now".

In particular, that will accept overlong and illegal utf-8 codepoints, and
probably misbehaves in strange and unusual non-ascii/non-utf-8 things
like iso2022-jp.

Personally, I'd just stick the isascii() in there and wait for a bug report
regarding the previous paragraph. :)


pgpylEu6aGNYk.pgp
Description: PGP signature


Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>You need to read a bit further down, where POSIX says
>
>The c argument is an int, the value of which the application shall
>ensure is representable as an unsigned char or equal to the value of
>the macro EOF. If the argument has any other value, the behavior is
>undefined.

Oof, fair enough; I stand corrected!

--Ken



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Tom Lane
Ken Hornstein  writes:
>> The  macros are just fundamentally broken in any locale that
>> has multibyte characters: you cannot squeeze a multibyte character
>> into an input that is supposed to be either an "unsigned char" or EOF.
>> Vendors can choose either to violate the spec (say, by interpreting
>> the "int" input as a Unicode codepoint) or to produce useless results.

> It's worth pointing out that the official prototype for the ctype macros
> all say they take "int" as an argument, and POSIX says they take as
> an argument a "character".  So interpreting that argument as a Unicode
> codepoint (assuming you're currently in a Unicode locale) is, from my
> reading, within the spec.

You need to read a bit further down, where POSIX says

The c argument is an int, the value of which the application shall
ensure is representable as an unsigned char or equal to the value of
the macro EOF. If the argument has any other value, the behavior is
undefined.

(C99 has identical verbiage.)

The reason to declare the argument as int is so that these can take EOF,
which I suppose is meant to allow them to be applied directly to the
result of getc() ... though why anyone would write code that way is
not clear to me.  Anyway, interpreting the input as a Unicode code point,
for values above U+7F (or, if you stretch it unreasonably, U+FF) is
very clearly outside the spec.

regards, tom lane



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread David Levine
Ken wrote:

> But it sounds like to me that everyone is on board with sprinkling in
> some isascii() calls there where it makes sense.

+1

David



Re: Bug reported regarding Unicode handling in email address

2021-06-02 Thread Ken Hornstein
>The  macros are just fundamentally broken in any locale that
>has multibyte characters: you cannot squeeze a multibyte character
>into an input that is supposed to be either an "unsigned char" or EOF.
>Vendors can choose either to violate the spec (say, by interpreting
>the "int" input as a Unicode codepoint) or to produce useless results.

It's worth pointing out that the official prototype for the ctype macros
all say they take "int" as an argument, and POSIX says they take as
an argument a "character".  So interpreting that argument as a Unicode
codepoint (assuming you're currently in a Unicode locale) is, from my
reading, within the spec.

But it sounds like to me that everyone is on board with sprinkling in
some isascii() calls there where it makes sense.

--Ken