On Thursday 09 March 2006 01:22, David Bertoni wrote: > That would require that C++ define some integral character type that is > encoded in UTF-16. It's unlikely that every compiler vendor would agree > to do that, although it would certainly make implementing software that > supports Unicode much easier.
After looking at things more closely, the Standard does - in its typical lawyerly language - require that an implementation behave 'as if' it supported UTF-16 for all the locales it supports. > XMLCh is defined to hold UTF-16 code units, which is a much stricter > requirement than anything the C++ standard says about character sets. The C++ Standard only specifies character sets. It does not specify encodings. > > In oder to implement the C++ extended character set, members > > of the C++ basic character set (ASCII character set) should be defined as > > wchar_t using their wide character literals. That is, for example: > > > > typedef wchar_t XMLCh; > > > > const XMLCh chLatin_A = L'A'; > > const XMLCh chLatin_B = L'B'; > > const XMLCh chLatin_C = L'C'; > > const XMLCh chLatin_D = L'D'; > > > > Rather than: > > > > typedef unsigned short XMLCh; > > > > const XMLCh chLatin_A = 0x41; > > const XMLCh chLatin_B = 0x42; > > const XMLCh chLatin_C = 0x43; > > const XMLCh chLatin_D = 0x44; > > You are making the assumption that the basic character set must be > encoded in ASCII, but the C++ standard makes no such requirement. No. That is exactly what I am not assuming. The example I show above will use whatever encoding my implementation uses for the characters assigned to the XMLCh constants. As long as my implementation supports the character set specified in UTF-16 (actually UCS-2) Xerces should work using those assignments. > > There may be reasons the Xerces developers chose to implement UTF-16 > > without conforming to the requirements for implementing the C++ extended > > character set. I guess, technically speaking, the encoding of UTF-16 and > > the extended character set will not, in general, coincide. > > I'm not sure I understand what you're saying. Xerces-C encodes > character data in UTF-16, and to do that, it uses a 16-bit integral. It > cannot use wchar_t to hold UTF-16 code units, because there is no > guarantee that a particular C++ implementation will encode wchar_t in > UTF-16. In fact, there is no requirement that wchar_t even be a 16-bit > integral It must be wide enough to encode all the UTF-16 characters of the extended character sets required by the implementation's supported locales. wchar_t shall have the same size, singedness and alignment requirements as one of the other integral data types. Can you give an example of a C++ implementation that doesn't use a 16 bit (or larger) data type for wchar_t? > Well, I would hope an ASCII character would be encoded in ASCII. ;-) > Perhaps what you really meant was that there is no requirement that the > basic character set be encoded in ASCII. The ASCII character set is the collection of alphabetical and punctuation symbols encoded by ASCII. Steven --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
