Jarkko Hietaniemi wrote:
> 
> Tim Bunce wrote:
> 
> > Am I right in thinking that perl's internal utf8 representation
> > represents surrogates as a single (4 byte) code point and not as
> > two separate code points?
> 
> Mmmh.  Right and wrong... as a single code point, yes, since 
> the real UTF-8 doesn't do surrogates which are only a UTF-16 
> thing.  4 bytes, no, 3 bytes.

Surrogates are the way UTF-16 to encodes non-BMP (>16bit) 
codepoints.

BMP code points are the Unicode codepoints 0 to 0xFFFF (16 bit) 
The non-BMP codepoints are 0x10000-0xFFFFF (20 bit). 

The "shortest form" security requirement requires the BMP and
non-BMP codepoints (encoded as surrogates in UTF-16) be encoded 
in the minimal number of bytes. For UTF-8 this means:

1-3 UTF-8 bytes encodes the BMP
-------------------------------
1 UTF-8 byte  = 7 bits
2 UTF-8 bytes = 5 bits + 6 bits = 11 bits
3 UTF-8 bytes = 4 bits + 6 bits + 6 bits = 16 bits

4 UTF-8 bytes encodes the non-BMP
---------------------------------
4 UTF-8 bytes = 3 bits + 6 bits + 6 bits + 6 bits = 21 bits

I suspect there is confusion in the original posting about 
what is meant by surrogates. Perhaps the question actually 
was intended to be: "when converting from UTF-16 to UTF-8 do 
the surrogate pairs become 4 or 6 UTF-8 bytes?".

> > This is the form that Oracle call AL32UTF8.
> 
> Does this
> 
> http://www.unicode.org/reports/tr26/
> 
> look like like Oracle's older (?) UTF8?
> 
> > What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> > byte string that used surrogates? Would there be problems?
> 
> You would get out the surrogate code points from the sv, not the
> supplementary plane code point the surrogate pairs are encoding.
> Depends what you do with the data: this might be okay, might not.
> Since it's valid UTF-8, nothing should croak perl-side.
> 
> > (For example, a string returned from Oracle when using the UTF8
> > character set instead of the newer AL32UTF8 one.)
> >
> > Tim.

-- 
Brian Stell

Reply via email to