Jarkko Hietaniemi wrote: > > Tim Bunce wrote: > > > Am I right in thinking that perl's internal utf8 representation > > represents surrogates as a single (4 byte) code point and not as > > two separate code points? > > Mmmh. Right and wrong... as a single code point, yes, since > the real UTF-8 doesn't do surrogates which are only a UTF-16 > thing. 4 bytes, no, 3 bytes.
Surrogates are the way UTF-16 to encodes non-BMP (>16bit) codepoints. BMP code points are the Unicode codepoints 0 to 0xFFFF (16 bit) The non-BMP codepoints are 0x10000-0xFFFFF (20 bit). The "shortest form" security requirement requires the BMP and non-BMP codepoints (encoded as surrogates in UTF-16) be encoded in the minimal number of bytes. For UTF-8 this means: 1-3 UTF-8 bytes encodes the BMP ------------------------------- 1 UTF-8 byte = 7 bits 2 UTF-8 bytes = 5 bits + 6 bits = 11 bits 3 UTF-8 bytes = 4 bits + 6 bits + 6 bits = 16 bits 4 UTF-8 bytes encodes the non-BMP --------------------------------- 4 UTF-8 bytes = 3 bits + 6 bits + 6 bits + 6 bits = 21 bits I suspect there is confusion in the original posting about what is meant by surrogates. Perhaps the question actually was intended to be: "when converting from UTF-16 to UTF-8 do the surrogate pairs become 4 or 6 UTF-8 bytes?". > > This is the form that Oracle call AL32UTF8. > > Does this > > http://www.unicode.org/reports/tr26/ > > look like like Oracle's older (?) UTF8? > > > What would be the effect of setting SvUTF8_on(sv) on a valid utf8 > > byte string that used surrogates? Would there be problems? > > You would get out the surrogate code points from the sv, not the > supplementary plane code point the surrogate pairs are encoding. > Depends what you do with the data: this might be okay, might not. > Since it's valid UTF-8, nothing should croak perl-side. > > > (For example, a string returned from Oracle when using the UTF8 > > character set instead of the newer AL32UTF8 one.) > > > > Tim. -- Brian Stell
