On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote:
> As you probably know perl's version of UTF-8 is not the real thing.  I
> thought I would hack up a patch to support the encoding as defined by
> Unicode.  That involves rejecting illegal chars (like surrogates,
> "\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences
> and such.

It's worth remembering that overlong sequences are a potential security risk.

> Before I do this I would like to get some feedback on the interface.
> My prefered interface would be to make:
> 
>    encode("UTF-8", $string)
> 
> imply the official restricted form

I think that would be best.

> and then have
> 
>    encode("UTF-8-Perl", $string)
> 
> be used as the name for Perl's relaxed and extended version of the
> encoding.  The encode_utf8($string) function would continue to be the
> same as encode("UTF-8-Perl", $string).

Isn't there a standard name for the 'unrestricted' encoding?
(Might be an IETF RFC rather than a unicode standard.)

> This implies that encode("UTF-8", $string) can start failing while
> previously it could not.

Anyone working with valid UTF-8 would not get failures.
Anyone who thinks they're using valid UTF-8 but aren't should be grateful!
Anyone not using valid UTF-8 (eg using it as a way to encode integers)
needs to be told in advance - but I doubt there are many and they're
likely to be cluefull users who read release notes :)

I'd say "UTF-8" should mean the official restricted form for perl 5.10.

The only remaining issues are then what to do for 5.8.7
and what to call the unrestricted encoding.

Tim.

Reply via email to