RE: PDUTR #26 posted

Carl W. Brown Mon, 17 Sep 2001 12:35:55 -0700
Mark,

> - Just because it is in IANA does *not* mean that everyone will 
> support it.
> There are many encodings in IANA supported by very few people. Nor does it
> mean that it is intended for widespread public use. The IANA registry is
> also used as a general purpose registry, even for encodings that have
> limited or restricted use.

True, but even if it does not have widespread use, it is a PUBLIC character set and is 
intended for some public communications.

> 
> - A significant reason for CESU-8 garnering enough support was that its
> introduction allows the definition of UTF-8 itself to be tightened, to
> formally exclude the 3-byte surrogates both in reading and writing.

I do not understand you point.  

>From TR27:

"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of 
five- and six-byte sequences to encode characters that are outside the range of the 
Unicode character set; those five- and six-byte sequences are illegal for the use of 
UTF-8 as a transformation of Unicode characters. ISO/IEC 10646 does not allow mapping 
of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters)."

CESU-8 is currently a non-compliant UTF-8 variant that is illegal to use in 3.1 
compliant software.  If a user does not upgrade their UCS-2 software they can still be 
Unicode compliant with older versions of Unicode that do not support the non-BMP 
characters.

If you accept CESU-8 then you are providing two divergent in incompatible standards.  

If a company does not use a private protocol outside of their own software then they 
can do anything that they want.  There is no need for Unicode to do anything.  The 
only reason that you might get involved is that different companies will use this 
standard and all have to implement the protocol in the same way.  This by definition 
is a public standard.

I suspect that the only reason that the committee has not rejected the proposal out of 
hand is that they acknowledge that there is a problem.  I suspect the Peoplesoft is 
not the only company with this problem.  

I feel that we need to do two things.  Help people migrate and end up with a single 
compatible standard.

First I think that we need to promote code point ordering support of applications that 
may do UTF transforms.  We need to disseminate code like Markus's code point order 
routines.  Because I support dynamic Unicode transforms in xIUA, I use code point 
ordering as the default either as supplied by ICU or using my own implementations 
derived from ICU code.

Second because the problem is that many systems still do not fully support planes.  We 
could amend the UCS-2 character set to exclude the surrogate range as noncharacters.  
We could then amend CESU-8 to exclude surrogates as well.  It would become a subset of 
UTF-8 (1 to 3 byte sequences only) that would work for BMP characters only.  By using 
a CESU-8 or UCS-2 character set you would warn any process that communicates with your 
application that you only support BMP characters.  This would be a very useful public 
standard.

Carl
RE: PDUTR #26 posted

Reply via email to