Gisle Aas wrote: > Perl use the UTF8_ALLOW_ANYUV mask in functions that should not be > restricted to only the valid Unicode code points. For some reason > this mask currently include the UTF8_ALLOW_LONG flag. This seems > totally wrong as there can't be a good reason to allow overlong > sequences just because we don't want to restrict the valid values. > > Perl's ord() function is for instance perfectly happy with an overlong NUL: > > $ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)' > 0 > > This patch fixes this problem:
Thanks, applied as change #23632 to bleadperl (although I'm not sure I fully understand all the implications.) > --- utf8.h.cur 2004-12-06 11:16:52.176181667 +0100 > +++ utf8.h 2004-12-06 11:17:16.672129909 +0100 > @@ -183,8 +183,7 @@ > #define UTF8_ALLOW_FFFF 0x0040 /* Allows also FFFE. */ > #define UTF8_ALLOW_LONG 0x0080 > #define UTF8_ALLOW_ANYUV (UTF8_ALLOW_EMPTY|UTF8_ALLOW_FE_FF|\ > - UTF8_ALLOW_SURROGATE|\ > - UTF8_ALLOW_FFFF|UTF8_ALLOW_LONG) > + UTF8_ALLOW_SURROGATE|UTF8_ALLOW_FFFF) > #define UTF8_ALLOW_ANY 0x00FF > #define UTF8_CHECK_ONLY 0x0200 > > > > With this patch the example above outputs: > > $ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)' > Malformed UTF-8 character (3 bytes, need 1, after start byte 0xe0) in ord at > -e line 1. > 0 Could you turn this into a regression test ? -- You probably wouldn't have expected a communist to have a dog named Harpo. -- Malcolm Lowry, Under the Volcano