Steven T. Hatton wrote:
On Thursday 09 March 2006 12:08, David Bertoni wrote:
I don't see how you can get this from the standard. There is only one
mention of Unicode, and UTF-16 does not appear anywhere. The only thing
I see is a statement about ISO/IEC 10646 and the
universal-character-name construct.
<quote
url="http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-5DFED1F0">
The UTF-16 encoding was chosen because of its widespread industry practice.
Note that for both HTML and XML, the document character set (and therefore
the notation of numeric character references) is based on UCS [ISO/IEC
10646]. A single numeric character reference in a source document may
therefore in some cases correspond to two 16-bit units in a DOMString (a high
surrogate and a low surrogate).
</quote>
<quote url="http://www.w3.org/TR/2004/REC-xml11-20040204/#charsets"
[Definition: A character is an atomic unit of text as specified by ISO/IEC
10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed,
and the legal characters of Unicode and ISO/IEC 10646. The versions of these
standards cited in A.1 Normative References were current at the time this
document was prepared. New characters may be added to these standards by
amendments or new editions. Consequently, XML processors MUST accept any
character in the range specified for Char.]
</quote>
It's not my fault! ;)
XMLCh is defined to hold UTF-16 code units, which is a much stricter
requirement than anything the C++ standard says about character sets.
The C++ Standard only specifies character sets. It does not specify
encodings.
That's exactly my point. And that's why you can't assume that char is
encoded in ASCII and wchar_t is encoded in UTF-16. However, Xerces-C
guarantees that XMLCh will contain UTF-16 code units.
After further investigation and reflection I have come to the conclusion that
you're damned if you do, and damend if you don't. You could convert all your
data to the implementations character encoding when it's read in, and do the
reverse when it is stored or transmitted. UTF-32 under some circumstances
that might provide some performance advantages. It would certainly make your
data compatable with the facilities provided by Standard Library.
I'm not sure what the performance advantages of UTF-32 would be over
UTF-16, unless you are referring to the handling of surrogate pairs. I
would imagine that the disadvantage of up to 16 bits of wasted storage
overhead would overwhelm the advantage gained from avoiding surrogates.
Also, can you explain why you believe UTF-32 would provide better
compatibility with the facilities provided by the standard library? On
Windows, this is certainly not the case. It might provide better
compatibility on some platforms operating with a locale that encodes
wchar_t in UTF-32, but that's not very portable.
I suspect most Xerces derived applications will need to do some kind of codec
of I/O. I know I don't want UTF-16 data stored in files I am likely to want
to edit, or otherwise manipulate outside of Xerces. If everybody played
nicely with UTF-16 that would be a different story.
Representing the full range of Unicode characters is difficult, no
matter how you encode them. I know lots of applications that play
nicely with UTF-16. In other cases, UTF-8 is a better choice, since it
maintains better compatibility with applications that expect ASCII data.
Some applications that use Xerces-C to parse XML files eventually
re-serialize the data to some encoding they prefer, perhaps even to the
original encoding.
No. That is exactly what I am not assuming. The example I show above
will use whatever encoding my implementation uses for the characters
assigned to the XMLCh constants. As long as my implementation supports
the character set specified in UTF-16 (actually UCS-2) Xerces should work
using those assignments.
Yes, but that's not very portable. Perhaps you don't support platforms
that do not meet this requirement, but Xerces-C does. By the way, UCS-2
support is not good enough for Xerces-C, because XML documents can
contain Unicode characters outside the BMP, which are represented as
surrogate pairs.
Yes, I see that now. I believe a conforming C++ implementation is required to
do the same (for the locales it supports.)
Why would Xerces-C choose an integral type that's larger than 16 bits
for its UTF-16 character integral? If wchar_t is a 32-bit integral,
then half of all storage allocated for a UTF-16 string would be wasted.
Agreed.
Also, Unicode conformance requires that UTF-16 strings use 16-bit
code units.
Well, all that UTF-16 support actually requires is that it's UTF-16 going in,
and UTF-16 coming out.
I'm not sure what you mean by this, but my reading of the Unicode
standard says that UTF-16 sequences are composed of UTF-16 code units,
and a UTF-16 code unit is defined as a 16-bit unit of storage. So it
would not be conformant to use a 32-bit unit of storage for a UTF-16
code unit in the APIs.
In addition, users would assume they could call the wide character
string system functions and expect reasonable results. That wouldn't
happen if the system and/or current locale didn't support UTF-16.
Well, you could use the UTF-32 internally, but that puts us back to the
subject of 50% unused primary storage. One option might be to have my C++
implementation (GCC) explicitly support UTF-16, and then have Xerces compile
with a flag to use it.
You can certainly do that, as long as you can limit the platforms you
support to those where you can rely on the compiler and run-time library
to support UTF-16.
C++ is, in may ways a better language than Java. UTF support is not one of
them. Yes! I'm frustrated!
I agree. I would be very helpful if the next C++ standard defined:
1. A unique 16-bit integral for UTF-16 code units.
2. Support in the library for std::basic_string instantiated with that type.
3. Some lexical construct at the source code level for character
literals and character string literals that produce characters and
strings encoded in UTF-16 .
4. Run-time library support for arrays of this type, providing full
support for Unicode.
But I suspect that's just a dream.
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]