Re: How to parse using DOM

David Bertoni Thu, 29 Nov 2007 11:13:35 -0800

Sven Bauhan wrote:

You are confusing code points and code units.  The size of a code unit in
UTF-8 is an octet (8 bits, or one byte on most architectures).  The number
of octets required to encode a particular Unicode code point in UTF-8 is 1,
2, 3, or 4.  If you ignore architectures where a byte stores more than 8
bits, you can then assume that an octet and a byte are interchangeable.

Then this means, that std::string can also be used as a container for UTF-8,but its length() does not need to be the correct number of UTF-8 characters.

I'm not sure what you mean by "UTF-8 characters," so I'm going to assumeyou mean the number of Unicode code points. But the system code page mightalso be a multi-byte encoding, or an encoding with shift states, solength() is also going to behave the same way in that case -- it's stillgoing to tell you the number of code units.

UTF-8 was designed to be compatible with the char data type, and
null-terminated arrays of UTF-8 code units are compatible with many C/C++
runtime functions that accept C-style strings.  The problems start when you
rely on locale-specific behavior, or you make assumptions about the
relationship of code points and code units.  For example, a substring
operation could be problematic if I split a multi-byte UTF-8 sequence.
Another example is code that relies on functions like isdigit, which are
sensitive to the locale and/or the system default encoding for char.  In
that case, UTF-8 bytes might be mistakenly interpreted as code points in
the system encoding.

AFAIK the std::string methods rely on the local code page.

Which member functions are you talking about? Since "the local code" pagevaries widely depending on the system and the current locale, you need tobe more specific.

This means that these methods do not work correctly whith UTF-8. But
the correctness of the  STL classes and methods is a major advantage
for the encapsulation. This is  the reason why I think, it would be
better to convert from XMLChar* to something like std::basic_string<UTF8Char>.
Then the string methods should  work yet. When using std::string in a
software you normally use the local  code page anyhow. So this might be

> also ok when converting from XMLChar*.

There is no way to make std::string work the way you want it to -- it'svery code unit-oriented. For example, you might want operator[] to work oncode points, but there's no way to do that. You also might want length()or size() to tell you the number of code points, but that won't happeneither. I've had some experience with these issues when trying toimplement char_traits<> for UTF-16 code units that works well when thereare surrogate pairs in the string.

std::string works fine with UTF-8 as long as you restrict yourself to thesubset of functionality that can be guaranteed to work across all possiblelocal code pages and locales.


Dave

Re: How to parse using DOM

Reply via email to