Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

DougEwell2 Tue, 19 Jun 2001 09:50:25 -0700
In a message dated 2001-06-19 6:46:14 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  If you take the original UCS-2 to UTF-8 mechanism
>  (back when UTF-8 was called UTF-FSS) and apply it to surrogates, the
>  sequence D800 DC00 would map to the sequence ED A0 80 ED B0 80.

Very true:
U+D800 U+DC00  ==  ED A0 80 ED B0 80
(assuming those are valid code points, which was true before 1993)

>  The sequence D800 DC00 was changed in UTF-16 to represent U+10000. If one 
did
>  not correct the UCS-2 software,

EXACTLY.  That is my point.  It is the transformation from UCS-2 to UTF-16 
that needs to be corrected, NOT the conversion to and from UTF-8.

>  and simply interpreted it according to UTF-16 semantics,
>  then one would end up with a (flawed) UTF-8 sequence representing U+10000.

U+10000  ==>  (UTF-16) D800 DC00  ==>  (UTF-8) F0 90 80 80

ED A0 80 ED B0 80 represents the two unpaired (but coincidentally 
consecutive) code points 0xD800 0xDC00, which is why it fulfills definition 
D29 which states that non-characters and unpaired surrogates have to be 
round-tripped.

>  This doesn't mean it was the correct thing to do. The ideal case would have
>  been to correct the software when there were no supplementary characters
>  (those requiring representation with surrogate pairs) that would cause a
>  different in interpretation between UTF-16 and UCS-2. People like database
>  vendors often have a huge requirement for stability, and must provide their
>  customers with solutions that are bug-for-bug compatible with older 
versions
>  for quite some time into the future. Yet there was a long period of time in
>  which to deprecate the older UCS-2 solution.

Absolutely.  There was a time when every line of code that I wrote having to 
do with Unicode assumed that all code points were 16 bits long and could fit 
in an unsigned short, and everything was nice and neat and orderly.  Like 
many others, I was somewhat disappointed when surrogates came along and I had 
to start playing the variable-length game.  Some of my code (for internal use 
only) was not corrected until well after 1993.

But none of that is the fault of the Unicode Consortium or ISO/IEC 
JTC1/SC2/WG2 for failing to warn me that supplementary characters were 
coming, some day.

I would not knowingly write code that failed to handle the Unicode code point 
U+0220, even though no character is currently assigned to that position.  The 
same is true of U+10000 through U+10FFFF.  Even the non-characters have to be 
handled, in their own way.

What I am trying to do is refute claims like this one:

>  As matter of fact, Oracle supported UTF-8 far earlier than surrogate or 
4-byte
>  encoding was introduced.

when there was NEVER a time in the history of UTF-FSS, UTF-2, or UTF-8 that 
4-byte encodings were not part of the specification.

And I am trying to show that, while actual assigned supplementary characters 
may not have appeared until Unicode 3.1, the *mechanism* to support them has 
been in place for years and years.  Waiting until characters were assigned 
outside the BMP to start working on the UCS-2 problem is like waiting until 
2000-01-01 to start working on the Y2K problem.

I think I am basically in agreement with Mark Davis here, which is good, 
because he is the expert and authority and I should try to ensure that my 
understanding matches his knowledge.

-Doug Ewell
 Fullerton, California
Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

Reply via email to