Re: How do I use Xerces strings?

David Bertoni Thu, 09 Mar 2006 09:09:19 -0800

Steven T. Hatton wrote:

On Thursday 09 March 2006 01:22, David Bertoni wrote:
That would require that C++ define some integral character type that is
encoded in UTF-16.  It's unlikely that every compiler vendor would agree
to do that, although it would certainly make implementing software that
supports Unicode much easier.
After looking at things more closely, the Standard does - in its typicallawyerly language - require that an implementation behave 'as if' itsupported UTF-16 for all the locales it supports.

I don't see how you can get this from the standard. There is only onemention of Unicode, and UTF-16 does not appear anywhere. The only thingI see is a statement about ISO/IEC 10646 and theuniversal-character-name construct.

XMLCh is defined to hold UTF-16 code units, which is a much stricter
requirement than anything the C++ standard says about character sets.
The C++ Standard only specifies character sets. It does not specifyencodings.

That's exactly my point. And that's why you can't assume that char isencoded in ASCII and wchar_t is encoded in UTF-16. However, Xerces-Cguarantees that XMLCh will contain UTF-16 code units.

In oder to implement the C++ extended character set, members
of the C++ basic character set (ASCII character set) should be defined as
wchar_t using their wide character literals.  That is, for example:

typedef wchar_t XMLCh;

const XMLCh chLatin_A               = L'A';
const XMLCh chLatin_B               = L'B';
const XMLCh chLatin_C               = L'C';
const XMLCh chLatin_D               = L'D';

Rather than:

typedef unsigned short XMLCh;

const XMLCh chLatin_A               = 0x41;
const XMLCh chLatin_B               = 0x42;
const XMLCh chLatin_C               = 0x43;
const XMLCh chLatin_D               = 0x44;

You are making the assumption that the basic character set must be
encoded in ASCII, but the C++ standard makes no such requirement.

No. That is exactly what I am not assuming. The example I show above willuse whatever encoding my implementation uses for the characters assigned tothe XMLCh constants. As long as my implementation supports the character setspecified in UTF-16 (actually UCS-2) Xerces should work using thoseassignments.

Yes, but that's not very portable. Perhaps you don't support platformsthat do not meet this requirement, but Xerces-C does. By the way, UCS-2support is not good enough for Xerces-C, because XML documents cancontain Unicode characters outside the BMP, which are represented assurrogate pairs.

There may be reasons the Xerces developers chose to implement UTF-16
without conforming to the requirements for implementing the C++ extended
character set.  I guess, technically speaking, the encoding of UTF-16 and
the extended character set will not, in general, coincide.
I'm not sure I understand what you're saying.  Xerces-C encodes
character data in UTF-16, and to do that, it uses a 16-bit integral. It
cannot use wchar_t to hold UTF-16 code units, because there is no
guarantee that a particular C++ implementation will encode wchar_t in
UTF-16.  In  fact, there is no requirement that wchar_t even be a 16-bit
integral
It must be wide enough to encode all the UTF-16 characters of the extendedcharacter sets required by the implementation's supported locales. wchar_tshall have the same size, singedness and alignment requirements as one of theother integral data types. Can you give an example of a C++ implementationthat doesn't use a 16 bit (or larger) data type for wchar_t?

Why would Xerces-C choose an integral type that's larger than 16 bitsfor its UTF-16 character integral? If wchar_t is a 32-bit integral,then half of all storage allocated for a UTF-16 string would be wasted.Also, Unicode conformance requires that UTF-16 strings use 16-bitcode units.

In addition, users would assume they could call the wide characterstring system functions and expect reasonable results. That wouldn'thappen if the system and/or current locale didn't support UTF-16.


Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How do I use Xerces strings?

Reply via email to