Re: [Encode] UCS/UTF mess and Surrogate Handlings

Jarkko Hietaniemi Fri, 05 Apr 2002 07:57:10 -0800

On Sat, Apr 06, 2002 at 01:08:11AM +0900, Dan Kogai wrote:
> On Saturday, April 6, 2002, at 12:18 , Jarkko Hietaniemi wrote:
> >> P.S.  Does utf8 support surrogates?  Surrogate pair is definitely the
> >
> > No.  Surrogates are solely for UTF-16.  There's no need for surrogates
> > in UTF-8 -- if we wanted to encode U+D800 using UTF-8, we *could* --
> > BUT we should not.  Encoding U+D800 as UTF-8 should not be attempted,
> > the whole surrogate space is a discontinuity in the Unicode code point
> > space reserved for the evils of UTF-16.
> 
> Yes.  I know that.  My question is whether we support CONVERSION.  
> Internals have nothing to do with that.  When we say UCS-2, 
> \x{10000}-\x{10ffff} must be discarded or croak for error.  When we say


I suggest croak.

> UTF-16, however, We have to convert them into surrogate pairs when we 
> convert and decode back to \x{10000}-\x{10ffff} when we decode.

Well, there seems to be

  Perl_utf16_to_utf8(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)

in utf8.c that seems to be doing surrogate arithmetics, but I think
that's not much used (if at all), and I cannot see utf8_to_utf16.
(There's also

  Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, I32 bytelen, I32 *newlen)

which does first a byteswap and then calls the non-reversed version).
I also can see that the Perl_utf16_to_utf8 is non-EBCDIC aware...

> FYI I have already cleaned up UCS-2 part.  Now their canonical names are 
> UCS-2BE and UCS-2LE (modules are renamed as well to be more cannonical, 
> ucs_2(be|le).pm.  Yes, underscore first).  UTF-32 is trivial because we 
> only have to pack the ord value to 32-bit.  It's UTF-16 in question.
> 
> If we want perl to be surrogates-free, then ironically we have to 
> support UTF-16 because ucs_2*.pm simply let \x{D800}-\x{DFFF} in so far.
> 
> Dan the Man with Too Many UnicodeS to tackle

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Re: [Encode] UCS/UTF mess and Surrogate Handlings

Reply via email to