On Sat, Apr 06, 2002 at 01:08:11AM +0900, Dan Kogai wrote: > On Saturday, April 6, 2002, at 12:18 , Jarkko Hietaniemi wrote: > >> P.S. Does utf8 support surrogates? Surrogate pair is definitely the > > > > No. Surrogates are solely for UTF-16. There's no need for surrogates > > in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* -- > > BUT we should not. Encoding U+D800 as UTF-8 should not be attempted, > > the whole surrogate space is a discontinuity in the Unicode code point > > space reserved for the evils of UTF-16. > > Yes. I know that. My question is whether we support CONVERSION. > Internals have nothing to do with that. When we say UCS-2, > \x{10000}-\x{10ffff} must be discarded or croak for error. When we say
I suggest croak. > UTF-16, however, We have to convert them into surrogate pairs when we > convert and decode back to \x{10000}-\x{10ffff} when we decode. Well, there seems to be Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen) in utf8.c that seems to be doing surrogate arithmetics, but I think that's not much used (if at all), and I cannot see utf8_to_utf16. (There's also Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen) which does first a byteswap and then calls the non-reversed version). I also can see that the Perl_utf16_to_utf8 is non-EBCDIC aware... > FYI I have already cleaned up UCS-2 part. Now their canonical names are > UCS-2BE and UCS-2LE (modules are renamed as well to be more cannonical, > ucs_2(be|le).pm. Yes, underscore first). UTF-32 is trivial because we > only have to pack the ord value to 32-bit. It's UTF-16 in question. > > If we want perl to be surrogates-free, then ironically we have to > support UTF-16 because ucs_2*.pm simply let \x{D800}-\x{DFFF} in so far. > > Dan the Man with Too Many UnicodeS to tackle -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen