RE: Nicest UTFFrom: Lars Kristan
I agree. But not for reasons you mentioned. There is one other important advantage:
UTF-8 is stored in a way that permits storing invalid sequences. I will need to
elaborate that, of course.

Not true for UTF-8. UTF-8 can only store valid sequences of code points, in the valid range from U+0000 to U+D7FF and U+E000 to U+10FFFF (so excluding surrogate code points).


But it's true that there are non standard extensions of UTF-8 (such as Sun's one for Java) that allow escaping some byte values normally generated by the standard UTF-8 (notably the single byte 0x00 representing U+0000), or that allow representing isolated or incorrectely paired surrogate code points which may be present in a normally invalid Unicode string, or that allow to represent non-BMP characters with 6 bytes, where each pair of 3 bytes represent surrogate code units (not code points!).

Only the CESU-8 variant of UTF-8 is documented and standardized (where non-BMP characters are represented by encoding on two groups of 3 bytes the two surrogate code units that would be used in UTF-16 to represent the same character). CESU-8 is less efficient than UTF-8, but even in that case it does not allow representing invalid Unicode strings containing surrogate *code points* which are not characters (I did not say *code units*), even if they are apparently correctly "paired" (the concept of paired surrogates only exist within the UTF-16 encoding scheme, that represent strings not as stream of characters coded with code points, but as streams of 16-bit code units).

If you need extensions like this, you do because you need to represent data which is not valid Unicode text. Such extended scheme is not a UTF, but a serialization format for this type of data (even if this type can represent all instances of valid Unicode text).





Reply via email to