Mark,
> - Just because it is in IANA does *not* mean that everyone will
> support it.
> There are many encodings in IANA supported by very few people. Nor does it
> mean that it is intended for widespread public use. The IANA registry is
> also used as a general purpose registry, even for encodings that have
> limited or restricted use.
True, but even if it does not have widespread use, it is a PUBLIC character set and is
intended for some public communications.
>
> - A significant reason for CESU-8 garnering enough support was that its
> introduction allows the definition of UTF-8 itself to be tightened, to
> formally exclude the 3-byte surrogates both in reading and writing.
I do not understand you point.
>From TR27:
"The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of
five- and six-byte sequences to encode characters that are outside the range of the
Unicode character set; those five- and six-byte sequences are illegal for the use of
UTF-8 as a transformation of Unicode characters. ISO/IEC 10646 does not allow mapping
of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters)."
CESU-8 is currently a non-compliant UTF-8 variant that is illegal to use in 3.1
compliant software. If a user does not upgrade their UCS-2 software they can still be
Unicode compliant with older versions of Unicode that do not support the non-BMP
characters.
If you accept CESU-8 then you are providing two divergent in incompatible standards.
If a company does not use a private protocol outside of their own software then they
can do anything that they want. There is no need for Unicode to do anything. The
only reason that you might get involved is that different companies will use this
standard and all have to implement the protocol in the same way. This by definition
is a public standard.
I suspect that the only reason that the committee has not rejected the proposal out of
hand is that they acknowledge that there is a problem. I suspect the Peoplesoft is
not the only company with this problem.
I feel that we need to do two things. Help people migrate and end up with a single
compatible standard.
First I think that we need to promote code point ordering support of applications that
may do UTF transforms. We need to disseminate code like Markus's code point order
routines. Because I support dynamic Unicode transforms in xIUA, I use code point
ordering as the default either as supplied by ICU or using my own implementations
derived from ICU code.
Second because the problem is that many systems still do not fully support planes. We
could amend the UCS-2 character set to exclude the surrogate range as noncharacters.
We could then amend CESU-8 to exclude surrogates as well. It would become a subset of
UTF-8 (1 to 3 byte sequences only) that would work for BMP characters only. By using
a CESU-8 or UCS-2 character set you would warn any process that communicates with your
application that you only support BMP characters. This would be a very useful public
standard.
Carl