Re: [patch] fix plain text output

Abdelrazak Younes Wed, 16 Aug 2006 09:43:21 -0700

Georg Baum wrote:

Am Mittwoch, 16. August 2006 18:12 schrieb Abdelrazak Younes:
Lars Gullik Bjønnes wrote:
string.length() will be lying to you when you store utf-8 in it.
Why is that? Because of some trailing \0?
No. utf8 is a multibyte encoding: Some characters use one byte, some twoand some even more AFAIK. The benefit of utf8 is that the ASCII charactersuse the same encoding as in the 7bit ASCII code. string.length() thereforedoes not always give the number of characters in the string if it is inutf8.

Hum... I am not I follows everything but let me summarize what Iunderstand from current code. The std::vectors I am talking about are:


* vector<char>: could be replaced by std::basic_string<char>

* vector<unsigned char>: that is ucs2 right? That could be replaced bystd::basic_string<unsigned char>* vector<boost::uint32_t>: I guess that is ucs4 and that could bereplaced by std::basic_string<unsigned char>

Internally we should just use one of those three types. The conversionto this complicate utf8 encoding should happen on input/output only.Handling a multi-byte encoding internally is just a recipe for a buggyfuture IMHO.


So what I do not get right here?

If the different parts all talk the same language why would there be anyconfusion? I mean, if it is just a matter of adding plus or minus one,that's not a big deal. And I guess we could still of course subclassbasic_string and re-implement length(), couldn't we?
That would not be so easys, because we would need to parse the utf8 encodedstring. Better leave that to some library.
Georg

Re: [patch] fix plain text output

Reply via email to