On Wed, Dec 01, 2004 at 01:28:05PM -0800, Gisle Aas wrote: > As you probably know perl's version of UTF-8 is not the real thing. I > thought I would hack up a patch to support the encoding as defined by > Unicode. That involves rejecting illegal chars (like surrogates, > "\x{FFFF}" and "\x{FDD0}), chars above 0x10FFFF, overlong sequences > and such.
It's worth remembering that overlong sequences are a potential security risk. > Before I do this I would like to get some feedback on the interface. > My prefered interface would be to make: > > encode("UTF-8", $string) > > imply the official restricted form I think that would be best. > and then have > > encode("UTF-8-Perl", $string) > > be used as the name for Perl's relaxed and extended version of the > encoding. The encode_utf8($string) function would continue to be the > same as encode("UTF-8-Perl", $string). Isn't there a standard name for the 'unrestricted' encoding? (Might be an IETF RFC rather than a unicode standard.) > This implies that encode("UTF-8", $string) can start failing while > previously it could not. Anyone working with valid UTF-8 would not get failures. Anyone who thinks they're using valid UTF-8 but aren't should be grateful! Anyone not using valid UTF-8 (eg using it as a way to encode integers) needs to be told in advance - but I doubt there are many and they're likely to be cluefull users who read release notes :) I'd say "UTF-8" should mean the official restricted form for perl 5.10. The only remaining issues are then what to do for 5.8.7 and what to call the unrestricted encoding. Tim.