Re: Unicode 4.0 BETA available for review

Kenneth Whistler Thu, 27 Feb 2003 23:36:09 -0800

Stefan Persson suggested:

> >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
> >sequences. There were two types:
> >
> >   a. 0xC0 0x80 for U+0000 (instead of 0x00)
> >   b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 
0x80)
> >  
> >
> Ah, but encoding NULL as a surrogate character and then encoding those 
> two surrogates as three bytes, making totally 6 bytes a character, would 
> also be technically possible (though not legal), right?


I'm not sure what you are talking about, here.

First of all, there is no such thing as a "surrogate character",
under the terminology currently adopted by the standard.

There are surrogate code points: U+D800..U+DFFF. Those can
*never* be assigned to any abstract character.

Then there are surrogate code units: 0xD800..0xDFFF. Those are
used in pairs in the UTF-16 encoding form to represent a single
supplementary character (one encoded off the BMP).

NULL is U+0000. 
  Its representation in UTF-32 is <0x00000000>.
  Its representation in UTF-16 is <0x0000>.
  Its representation in UTF-8  is <0x00>.
  
Period. End of story. Anything else is nonconformant to the standard.

--Ken

Re: Unicode 4.0 BETA available for review

Reply via email to