Georg Baum wrote:
Am Mittwoch, 16. August 2006 18:12 schrieb Abdelrazak Younes:
Lars Gullik Bjønnes wrote:
string.length() will be lying to you when you store utf-8 in it.
Why is that? Because of some trailing \0?
No. utf8 is a multibyte encoding: Some characters use one byte, some two
and some even more AFAIK. The benefit of utf8 is that the ASCII characters
use the same encoding as in the 7bit ASCII code. string.length() therefore
does not always give the number of characters in the string if it is in
utf8.
Hum... I am not I follows everything but let me summarize what I
understand from current code. The std::vectors I am talking about are:
* vector<char>: could be replaced by std::basic_string<char>
* vector<unsigned char>: that is ucs2 right? That could be replaced by
std::basic_string<unsigned char>
* vector<boost::uint32_t>: I guess that is ucs4 and that could be
replaced by std::basic_string<unsigned char>
Internally we should just use one of those three types. The conversion
to this complicate utf8 encoding should happen on input/output only.
Handling a multi-byte encoding internally is just a recipe for a buggy
future IMHO.
So what I do not get right here?
If the different parts all talk the same language why would there be any
confusion? I mean, if it is just a matter of adding plus or minus one,
that's not a big deal. And I guess we could still of course subclass
basic_string and re-implement length(), couldn't we?
That would not be so easys, because we would need to parse the utf8 encoded
string. Better leave that to some library.
Georg