Re: Problems with xerces-c version 1.7.0 and UTF-8

Alberto Massari Wed, 17 Sep 2008 06:41:47 -0700

Hi Anna,

if I am not mistaken, the code you attached doesn't have the sample datayou are trying to parse (e.g. parseString is used to parse the result ofa toXML call on an extension object).However, you say "in the dom_wrapper.c I print the string before it ispassed to the xerces-c parser [...] and my value in utf-8 looks fine";in the code you write


  cout << "parseString: " << str << endl;
  return parseMemory(str.c_str(),(int)str.length());

But the fact that your console prints the data as you expects doesn'timply that the std::string contains real UTF-8; your shell could beusing a Japanese locale, and be able to print correctlyShift_JIS-encoded strings (while failing to print UTF-8-encoded strings).If you want to really see what you are considering UTF-8, replace thatcout << str with this code


for(int i=0;i<str.length();i++)
 cout << "0x" << hex << (int)str[i] << " ";
cout << endl;

Alberto

Anna Simbirtsev wrote:

In the epp_eppXMLbase.cc in function createDOMDocument it calls
parseString function from domtools::XercesParser. In the dom_wrapper.c I
print the string before it is passed to the xerces-c parser in
domtools::XercesParser::parseMemory function and my value in utf-8 looks
fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
object(defined in dom_wrapper.h) to store the DOM_document and break it
into nodes. Then in epp_eppXMLbase.cc, in function
eppobject::epp::addExtensionElements(EPP_output & outputobject, const
epp_extension_ref_seq_ref & extensions)

it calls
DomPrint dp(outputobject);
dp.putDOMTree(extensionDoc);

from dom_print.cc where I actually print the value in putDOMTree
function. Here the value looks truncated.
The entire source code of domtools is available on
http://sourceforge.net/project/showfiles.php?group_id=26675

Thank you very much for your help.

On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:

Anna Simbirtsev wrote:
I pass just plain xml string to the DOMParser, so I don't use the
transcode function.

[...]
I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.
If you just used cut & paste from your browser to your C++ code editor,I can bet you are not pasting UTF-8 codepoints, but something in yourlocal code page. Can you attach your source code to this e-mail(attached, not copied)?
Alberto

Re: Problems with xerces-c version 1.7.0 and UTF-8

Reply via email to