Re: real UTF-8 vs. utf8n_to_uvuni()

Gisle Aas Thu, 09 Dec 2004 07:41:36 -0800

Dan Kogai <[EMAIL PROTECTED]> writes:

> --- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
> +++ perl-5.8.x.dan/utf8.c       Sun Dec  5 11:38:52 2004
> @@ -429,6 +429,13 @@
>          }
>          else
>              uv = UTF8_ACCUMULATE(uv, *s);
> +       /* Checks if ord() > 0x10FFFF -- dankogai */
> +       if (uv > PERL_UNICODE_MAX){
> +           if (!(flags & UTF8_ALLOW_LONG)) {
> +               warning = UTF8_WARN_LONG;
> +               goto malformed;
> +           }
> +       }
>          if (!(uv > ouv)) {
>              /* These cannot be allowed. */
>              if (uv == ouv) {


I think this patch is wrong since UTF8_ALLOW_LONG is about allowing
overlong sequences.  What we need is a UTF8_ALLOW_SUPER flag (matching
UNICODE_ALLOW_SUPER) that would indicate that code points past 10xFFFF
should be allowed.  This would be the flag that UTF8_ALLOW_ANYUV
should contain instead of UTF8_ALLOW_LONG.

Unfortunately there is no more room for UTF8_ALLOW_* flags in the
UTF8_ALLOW_ANY space so we would have to add some bits to this mask,
which give us binary incompatiblity with extensions that use the old
UTF8_ALLOW_ANY value.

The UTF8_ALLOW_FFFF should also allow 0x1FFFF, 0x2FFFF as well as the
0xFFFE variants.  This match the UNICODE_ALLOW_FFFF behaviour.
Currently it only allows 0xFFFF.

The UTF8_ALLOW_FDD0 flag to match UNICODE_ALLOW_FDD0 is also missing,
but insted of introducing UTF8_ALLOW_FDD0 it seems better to collapse
the *_ALLOW_FFFF and *_ALLOW_FDD0 flags into a single *_ALLOW_ILLEGAL
and then make UNICODE_IS_ILLEGAL() match this.

--Gisle

Re: real UTF-8 vs. utf8n_to_uvuni()

Reply via email to