Re: How do I use Xerces strings?

David Bertoni Thu, 09 Mar 2006 11:43:20 -0800

Steven T. Hatton wrote:

On Thursday 09 March 2006 12:08, David Bertoni wrote:
I don't see how you can get this from the standard.  There is only one
mention of Unicode, and UTF-16 does not appear anywhere.  The only thing
I see is a statement about ISO/IEC 10646 and the
universal-character-name construct.
<quoteurl="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0";>The UTF-16 encoding was chosen because of its widespread industry practice.Note that for both HTML and XML, the document character set (and thereforethe notation of numeric character references) is based on UCS [ISO/IEC10646]. A single numeric character reference in a source document maytherefore in some cases correspond to two 16-bit units in a DOMString (a highsurrogate and a low surrogate).
</quote>

<quote url="http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets";
[Definition: A character is an atomic unit of text as specified by ISO/IEC10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed,and the legal characters of Unicode and ISO/IEC 10646. The versions of thesestandards cited in A.1 Normative References were current at the time thisdocument was prepared. New characters may be added to these standards byamendments or new editions. Consequently, XML processors MUST accept anycharacter in the range specified for Char.]
</quote>

It's not my fault! ;)
XMLCh is defined to hold UTF-16 code units, which is a much stricter
requirement than anything the C++ standard says about character sets.
The C++ Standard only specifies character sets.  It does not specify
encodings.
That's exactly my point.  And that's why you can't assume that char is
encoded in ASCII and wchar_t is encoded in UTF-16.  However, Xerces-C
guarantees that XMLCh will contain UTF-16 code units.
After further investigation and reflection I have come to the conclusion thatyou're damned if you do, and damend if you don't. You could convert all yourdata to the implementations character encoding when it's read in, and do thereverse when it is stored or transmitted. UTF-32 under some circumstancesthat might provide some performance advantages. It would certainly make yourdata compatable with the facilities provided by Standard Library.

I'm not sure what the performance advantages of UTF-32 would be overUTF-16, unless you are referring to the handling of surrogate pairs. Iwould imagine that the disadvantage of up to 16 bits of wasted storageoverhead would overwhelm the advantage gained from avoiding surrogates.

Also, can you explain why you believe UTF-32 would provide bettercompatibility with the facilities provided by the standard library? OnWindows, this is certainly not the case. It might provide bettercompatibility on some platforms operating with a locale that encodeswchar_t in UTF-32, but that's not very portable.

I suspect most Xerces derived applications will need to do some kind of codecof I/O. I know I don't want UTF-16 data stored in files I am likely to wantto edit, or otherwise manipulate outside of Xerces. If everybody playednicely with UTF-16 that would be a different story.

Representing the full range of Unicode characters is difficult, nomatter how you encode them. I know lots of applications that playnicely with UTF-16. In other cases, UTF-8 is a better choice, since itmaintains better compatibility with applications that expect ASCII data.

Some applications that use Xerces-C to parse XML files eventuallyre-serialize the data to some encoding they prefer, perhaps even to theoriginal encoding.

No.  That is exactly what I am not assuming.  The example I show above
will use whatever encoding my implementation uses for the characters
assigned to the XMLCh constants.  As long as my implementation supports
the character set specified in UTF-16 (actually UCS-2) Xerces should work
using those assignments.

Yes, but that's not very portable.  Perhaps you don't support platforms
that do not meet this requirement, but Xerces-C does.  By the way, UCS-2
support is not good enough for Xerces-C, because XML documents can
contain Unicode characters outside the BMP, which are represented as
surrogate pairs.

Yes, I see that now. I believe a conforming C++ implementation is required todo the same (for the locales it supports.)

Why would Xerces-C choose an integral type that's larger than 16 bits
for its UTF-16 character integral?  If wchar_t is a 32-bit integral,
then half of all storage allocated for a UTF-16 string would be wasted.


Agreed.

    Also, Unicode conformance requires that UTF-16 strings use 16-bit
code units.

Well, all that UTF-16 support actually requires is that it's UTF-16 going in,and UTF-16 coming out.

I'm not sure what you mean by this, but my reading of the Unicodestandard says that UTF-16 sequences are composed of UTF-16 code units,and a UTF-16 code unit is defined as a 16-bit unit of storage. So itwould not be conformant to use a 32-bit unit of storage for a UTF-16code unit in the APIs.

In addition, users would assume they could call the wide character
string system functions and expect reasonable results.  That wouldn't
happen if the system and/or current locale didn't support UTF-16.
Well, you could use the UTF-32 internally, but that puts us back to thesubject of 50% unused primary storage. One option might be to have my C++implementation (GCC) explicitly support UTF-16, and then have Xerces compilewith a flag to use it.

You can certainly do that, as long as you can limit the platforms yousupport to those where you can rely on the compiler and run-time libraryto support UTF-16.

C++ is, in may ways a better language than Java. UTF support is not one ofthem. Yes! I'm frustrated!


I agree.  I would be very helpful if the next C++ standard defined:

1. A unique 16-bit integral for UTF-16 code units.
2. Support in the library for std::basic_string instantiated with that type.

3. Some lexical construct at the source code level for characterliterals and character string literals that produce characters andstrings encoded in UTF-16 .4. Run-time library support for arrays of this type, providing fullsupport for Unicode.


But I suspect that's just a dream.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How do I use Xerces strings?

Reply via email to