Steven T. Hatton wrote:
On Thursday 09 March 2006 01:22, David Bertoni wrote:

That would require that C++ define some integral character type that is
encoded in UTF-16.  It's unlikely that every compiler vendor would agree
to do that, although it would certainly make implementing software that
supports Unicode much easier.

After looking at things more closely, the Standard does - in its typical lawyerly language - require that an implementation behave 'as if' it supported UTF-16 for all the locales it supports.


I don't see how you can get this from the standard. There is only one mention of Unicode, and UTF-16 does not appear anywhere. The only thing I see is a statement about ISO/IEC 10646 and the universal-character-name construct.

XMLCh is defined to hold UTF-16 code units, which is a much stricter
requirement than anything the C++ standard says about character sets.

The C++ Standard only specifies character sets. It does not specify encodings.


That's exactly my point. And that's why you can't assume that char is encoded in ASCII and wchar_t is encoded in UTF-16. However, Xerces-C guarantees that XMLCh will contain UTF-16 code units.

In oder to implement the C++ extended character set, members
of the C++ basic character set (ASCII character set) should be defined as
wchar_t using their wide character literals.  That is, for example:

typedef wchar_t XMLCh;

const XMLCh chLatin_A               = L'A';
const XMLCh chLatin_B               = L'B';
const XMLCh chLatin_C               = L'C';
const XMLCh chLatin_D               = L'D';

Rather than:

typedef unsigned short XMLCh;

const XMLCh chLatin_A               = 0x41;
const XMLCh chLatin_B               = 0x42;
const XMLCh chLatin_C               = 0x43;
const XMLCh chLatin_D               = 0x44;
You are making the assumption that the basic character set must be
encoded in ASCII, but the C++ standard makes no such requirement.

No. That is exactly what I am not assuming. The example I show above will use whatever encoding my implementation uses for the characters assigned to the XMLCh constants. As long as my implementation supports the character set specified in UTF-16 (actually UCS-2) Xerces should work using those assignments.


Yes, but that's not very portable. Perhaps you don't support platforms that do not meet this requirement, but Xerces-C does. By the way, UCS-2 support is not good enough for Xerces-C, because XML documents can contain Unicode characters outside the BMP, which are represented as surrogate pairs.

There may be reasons the Xerces developers chose to implement UTF-16
without conforming to the requirements for implementing the C++ extended
character set.  I guess, technically speaking, the encoding of UTF-16 and
the extended character set will not, in general, coincide.
I'm not sure I understand what you're saying.  Xerces-C encodes
character data in UTF-16, and to do that, it uses a 16-bit integral. It
cannot use wchar_t to hold UTF-16 code units, because there is no
guarantee that a particular C++ implementation will encode wchar_t in
UTF-16.  In  fact, there is no requirement that wchar_t even be a 16-bit
integral

It must be wide enough to encode all the UTF-16 characters of the extended character sets required by the implementation's supported locales. wchar_t shall have the same size, singedness and alignment requirements as one of the other integral data types. Can you give an example of a C++ implementation that doesn't use a 16 bit (or larger) data type for wchar_t?

Why would Xerces-C choose an integral type that's larger than 16 bits for its UTF-16 character integral? If wchar_t is a 32-bit integral, then half of all storage allocated for a UTF-16 string would be wasted. Also, Unicode conformance requires that UTF-16 strings use 16-bit code units.

In addition, users would assume they could call the wide character string system functions and expect reasonable results. That wouldn't happen if the system and/or current locale didn't support UTF-16.

Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to