On Wednesday 08 March 2006 02:18, Scott Cantor wrote: > > IIRC, there /are/ different UTF encodings, even within UTF-16. > > There is something called UCS-4, and also something called UCS-2 (I > > believe). I do not know the difference between these and their related > > UTF-32 and UTF-16. > > Nor I, but that's what I had in mind when I expressed caution.
To my mind, the failure to specify a UTF-16 string class is one of the worst aspects of C++. After reading the applicable sections of ISO/IEC 14882:2003, I have come to the conclusion that the Xerces XMLCh is not defined in such a way as to conform to the definition of a C++ implementation's extended character set. In oder to implement the C++ extended character set, members of the C++ basic character set (ASCII character set) should be defined as wchar_t using their wide character literals. That is, for example: typedef wchar_t XMLCh; const XMLCh chLatin_A = L'A'; const XMLCh chLatin_B = L'B'; const XMLCh chLatin_C = L'C'; const XMLCh chLatin_D = L'D'; Rather than: typedef unsigned short XMLCh; const XMLCh chLatin_A = 0x41; const XMLCh chLatin_B = 0x42; const XMLCh chLatin_C = 0x43; const XMLCh chLatin_D = 0x44; There may be reasons the Xerces developers chose to implement UTF-16 without conforming to the requirements for implementing the C++ extended character set. I guess, technically speaking, the encoding of UTF-16 and the extended character set will not, in general, coincide. That is, there is no requirement that the ASCII character set be encoded using ASCII values. In such a case, then the numerical value of chLatin_A would not be the same in all implementations. Nonetheless (IMO), properly written code should not rely on such implementation details. Steven --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
