Steven T. Hatton wrote:
On Thursday 09 March 2006 12:08, David Bertoni wrote:

I don't see how you can get this from the standard.  There is only one
mention of Unicode, and UTF-16 does not appear anywhere.  The only thing
I see is a statement about ISO/IEC 10646 and the
universal-character-name construct.
<quote url="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0";> The UTF-16 encoding was chosen because of its widespread industry practice. Note that for both HTML and XML, the document character set (and therefore the notation of numeric character references) is based on UCS [ISO/IEC 10646]. A single numeric character reference in a source document may therefore in some cases correspond to two 16-bit units in a DOMString (a high surrogate and a low surrogate).
</quote>

<quote url="http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets";
[Definition: A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char.]
</quote>

It's not my fault! ;)

XMLCh is defined to hold UTF-16 code units, which is a much stricter
requirement than anything the C++ standard says about character sets.
The C++ Standard only specifies character sets.  It does not specify
encodings.
That's exactly my point.  And that's why you can't assume that char is
encoded in ASCII and wchar_t is encoded in UTF-16.  However, Xerces-C
guarantees that XMLCh will contain UTF-16 code units.

After further investigation and reflection I have come to the conclusion that you're damned if you do, and damend if you don't. You could convert all your data to the implementations character encoding when it's read in, and do the reverse when it is stored or transmitted. UTF-32 under some circumstances that might provide some performance advantages. It would certainly make your data compatable with the facilities provided by Standard Library.


I'm not sure what the performance advantages of UTF-32 would be over UTF-16, unless you are referring to the handling of surrogate pairs. I would imagine that the disadvantage of up to 16 bits of wasted storage overhead would overwhelm the advantage gained from avoiding surrogates.

Also, can you explain why you believe UTF-32 would provide better compatibility with the facilities provided by the standard library? On Windows, this is certainly not the case. It might provide better compatibility on some platforms operating with a locale that encodes wchar_t in UTF-32, but that's not very portable.

I suspect most Xerces derived applications will need to do some kind of codec of I/O. I know I don't want UTF-16 data stored in files I am likely to want to edit, or otherwise manipulate outside of Xerces. If everybody played nicely with UTF-16 that would be a different story.


Representing the full range of Unicode characters is difficult, no matter how you encode them. I know lots of applications that play nicely with UTF-16. In other cases, UTF-8 is a better choice, since it maintains better compatibility with applications that expect ASCII data.

Some applications that use Xerces-C to parse XML files eventually re-serialize the data to some encoding they prefer, perhaps even to the original encoding.

No.  That is exactly what I am not assuming.  The example I show above
will use whatever encoding my implementation uses for the characters
assigned to the XMLCh constants.  As long as my implementation supports
the character set specified in UTF-16 (actually UCS-2) Xerces should work
using those assignments.
Yes, but that's not very portable.  Perhaps you don't support platforms
that do not meet this requirement, but Xerces-C does.  By the way, UCS-2
support is not good enough for Xerces-C, because XML documents can
contain Unicode characters outside the BMP, which are represented as
surrogate pairs.

Yes, I see that now. I believe a conforming C++ implementation is required to do the same (for the locales it supports.)

Why would Xerces-C choose an integral type that's larger than 16 bits
for its UTF-16 character integral?  If wchar_t is a 32-bit integral,
then half of all storage allocated for a UTF-16 string would be wasted.

Agreed.

    Also, Unicode conformance requires that UTF-16 strings use 16-bit
code units.

Well, all that UTF-16 support actually requires is that it's UTF-16 going in, and UTF-16 coming out.

I'm not sure what you mean by this, but my reading of the Unicode standard says that UTF-16 sequences are composed of UTF-16 code units, and a UTF-16 code unit is defined as a 16-bit unit of storage. So it would not be conformant to use a 32-bit unit of storage for a UTF-16 code unit in the APIs.


In addition, users would assume they could call the wide character
string system functions and expect reasonable results.  That wouldn't
happen if the system and/or current locale didn't support UTF-16.

Well, you could use the UTF-32 internally, but that puts us back to the subject of 50% unused primary storage. One option might be to have my C++ implementation (GCC) explicitly support UTF-16, and then have Xerces compile with a flag to use it.


You can certainly do that, as long as you can limit the platforms you support to those where you can rely on the compiler and run-time library to support UTF-16.

C++ is, in may ways a better language than Java. UTF support is not one of them. Yes! I'm frustrated!


I agree.  I would be very helpful if the next C++ standard defined:

1. A unique 16-bit integral for UTF-16 code units.
2. Support in the library for std::basic_string instantiated with that type.
3. Some lexical construct at the source code level for character literals and character string literals that produce characters and strings encoded in UTF-16 . 4. Run-time library support for arrays of this type, providing full support for Unicode.

But I suspect that's just a dream.

Dave


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to