Anna Simbirtsev wrote:
When I print it in hex format, I get
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1
�: 0xffffffd0
�: 0xffffffb1

Which I am not even sure what format, but maybe my shell does not
know what it is.
You need to understand the limitations of any library you use. Here is a snippet of the source code from the domtools library you're using:

string domtools::toString(const DOMString s)
{
   char * t = s.transcode();
   if (!t) return "";
   string tmp = t;
   delete [] t;
   return tmp;
}

You can see the call to DOMString::transcode(). This will fail when characters in the DOMString are not representable in the local code page. This is likely what's happening, and I suggest you find another library to use, because this one is broken.

Alternately, if you always want to transcode data to UTF-8, you can modify the library to use a UTF-8 transcoder. There was another thread late last week and this week on this topic.

Dave

Reply via email to