Re: How to parse using DOM

Sven Bauhan Thu, 29 Nov 2007 01:09:29 -0800

> You are confusing code points and code units.  The size of a code unit in
> UTF-8 is an octet (8 bits, or one byte on most architectures).  The number
> of octets required to encode a particular Unicode code point in UTF-8 is 1,
> 2, 3, or 4.  If you ignore architectures where a byte stores more than 8
> bits, you can then assume that an octet and a byte are interchangeable.
>
Then this means, that std::string can also be used as a container for UTF-8, 
but its length() does not need to be the correct number of UTF-8 characters.


> UTF-8 was designed to be compatible with the char data type, and
> null-terminated arrays of UTF-8 code units are compatible with many C/C++
> runtime functions that accept C-style strings.  The problems start when you
> rely on locale-specific behavior, or you make assumptions about the
> relationship of code points and code units.  For example, a substring
> operation could be problematic if I split a multi-byte UTF-8 sequence.
> Another example is code that relies on functions like isdigit, which are
> sensitive to the locale and/or the system default encoding for char.  In
> that case, UTF-8 bytes might be mistakenly interpreted as code points in
> the system encoding.
>
AFAIK the std::string methods rely on the local code page. This means that 
these methods do not work correctly whith UTF-8. But the correctness of the 
STL classes and methods is a major advantage for the encapsulation. This is 
the reason why I think, it would be better to convert from XMLChar* to 
something like std::basic_string<UTF8Char>. Then the string methods should 
work yet. When using std::string in a software you normally use the local 
code page anyhow. So this might be also ok when converting from XMLChar*.

Sven

Re: How to parse using DOM

Reply via email to