On 03/06/2010 10:23, Joe Andrieu wrote: > I understand that Redland uses UTF-8 internally and in its API. > > However, it isn't clear to me the right way to convert from those > unsigned char* strings to std::wstring or std::string classes, which is > what my program uses internally. > > I have the code "working" using std::string, but I would bet that my > code wouldn't handle non-ascii characters properly. > > Can anyone provide some guidance? > > -j
AFAIK the C++ standard library does not provide a string type that can reliably store Unicode. presumably this is because it's difficult to implement random-access operations on variable-length encodings like UTF-8. [but of course in practice nobody really wants to randomly access individual characters in a string; what people want is to iterate over the string, yielding a Unicode character each step, and there's not much of a problem with that] basically you can put UTF-8 encoded string into std::string, but: - all the methods work on the individual bytes - there's no support for accessing the individual UTF-8 characters - it's far too easy to shoot yourself in the foot so this does not look like an approach that yields reliable programs. then there is std::wstring, which uses wchar_t, but it's not usable either, because on lots of C++ implementations wchar_t is just 16 bit in size, which is not sufficient to represent all Unicode characters. if you use a 16-bit wchar_t, with UTF-16 encoding, then you get the same disadvantages as when storing UTF-8 in std::string, they're just more difficult to detect because they only occur on characters that are seldomly used. we have some experience with that particular problem in OOo; the "default" one of our ~6 different string classes (::rtl::OUString) uses UTF-16 encoding and exposes this encoding to client code, and a non-exhaustive list of problems that result from that choice can be found here: http://qa.openoffice.org/issues/show_bug.cgi?id=102943 imho if there is a reliable and portable way to store Unicode in a standard C++ string, then it is to specialize the base_string template for a 32-bit integer type, and use that. but i've never tried it, so i don't know what disadvantages that approach may have. regards, michael -- "I believe in Spinoza's God who reveals himself in the orderly harmony of what exists, not in a God who concerns himself with the fates and actions of human beings." -- Albert Einstein _______________________________________________ redland-dev mailing list [email protected] http://lists.librdf.org/mailman/listinfo/redland-dev
