Am Montag, den 13.04.2009, 18:47 +0800 schrieb Weimin Xie: > I'm learning how to use Glib::ustring. My goal is to split an ustring > of unicode character into a vector container. In a simple case, my > program have read a Chinese character, for example, "你". When I tried > to use the Glib::ustring::iterator to go over the ustring, it shows > there are more than one entry. > > If description = "你", then [...] > Gives me > size <5> bytes <8> char <228> char <189> char <160> char <10> char > <10>
To me, this looks suspiciously like something that would happen if a string gets encoded twice. That is, I suspect you already had a UTF-8 encoded string, which subsequently got interpreted as a string of ISO-8859-1 bytes and then translated a second time to UTF-8. With just one code point (你) plus the two trailing newline characters, the output for size should have been 3 instead of 5. And the number of bytes should have been 5 rather than 8. The interpretation of a UTF-8 string as ISO-8859-1 would also explain why you see exactly the numbers you would see if you were iterating over the bytes of the correctly encoded original string -- that's because up to code point 255, Unicode is identical to ISO-8859-1. > Can someone please explain why the iterator doesn't go over the > unicode characters as expected? It probably does. It's just that your string doesn't contain what you think it does. > Thanks a lot in advance! You're welcome. If you still think it's a problem of glibmm, please file a bug and attach a test case, so we can reproduce the problem. --Daniel _______________________________________________ gtkmm-list mailing list [email protected] http://mail.gnome.org/mailman/listinfo/gtkmm-list
