> You are confusing code points and code units. The size of a code unit in > UTF-8 is an octet (8 bits, or one byte on most architectures). The number > of octets required to encode a particular Unicode code point in UTF-8 is 1, > 2, 3, or 4. If you ignore architectures where a byte stores more than 8 > bits, you can then assume that an octet and a byte are interchangeable. > Then this means, that std::string can also be used as a container for UTF-8, but its length() does not need to be the correct number of UTF-8 characters.
> UTF-8 was designed to be compatible with the char data type, and > null-terminated arrays of UTF-8 code units are compatible with many C/C++ > runtime functions that accept C-style strings. The problems start when you > rely on locale-specific behavior, or you make assumptions about the > relationship of code points and code units. For example, a substring > operation could be problematic if I split a multi-byte UTF-8 sequence. > Another example is code that relies on functions like isdigit, which are > sensitive to the locale and/or the system default encoding for char. In > that case, UTF-8 bytes might be mistakenly interpreted as code points in > the system encoding. > AFAIK the std::string methods rely on the local code page. This means that these methods do not work correctly whith UTF-8. But the correctness of the STL classes and methods is a major advantage for the encapsulation. This is the reason why I think, it would be better to convert from XMLChar* to something like std::basic_string<UTF8Char>. Then the string methods should work yet. When using std::string in a software you normally use the local code page anyhow. So this might be also ok when converting from XMLChar*. Sven
