Stefan Persson suggested: > >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value > >sequences. There were two types: > > > > a. 0xC0 0x80 for U+0000 (instead of 0x00) > > b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 0x80) > > > > > Ah, but encoding NULL as a surrogate character and then encoding those > two surrogates as three bytes, making totally 6 bytes a character, would > also be technically possible (though not legal), right?
I'm not sure what you are talking about, here. First of all, there is no such thing as a "surrogate character", under the terminology currently adopted by the standard. There are surrogate code points: U+D800..U+DFFF. Those can *never* be assigned to any abstract character. Then there are surrogate code units: 0xD800..0xDFFF. Those are used in pairs in the UTF-16 encoding form to represent a single supplementary character (one encoded off the BMP). NULL is U+0000. Its representation in UTF-32 is <0x00000000>. Its representation in UTF-16 is <0x0000>. Its representation in UTF-8 is <0x00>. Period. End of story. Anything else is nonconformant to the standard. --Ken