Hi Anna,
if I am not mistaken, the code you attached doesn't have the sample data you are trying to parse (e.g. parseString is used to parse the result of a toXML call on an extension object). However, you say "in the dom_wrapper.c I print the string before it is passed to the xerces-c parser [...] and my value in utf-8 looks fine"; in the code you write

  cout << "parseString: " << str << endl;
  return parseMemory(str.c_str(),(int)str.length());

But the fact that your console prints the data as you expects doesn't imply that the std::string contains real UTF-8; your shell could be using a Japanese locale, and be able to print correctly Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings). If you want to really see what you are considering UTF-8, replace that cout << str with this code

for(int i=0;i<str.length();i++)
 cout << "0x" << hex << (int)str[i] << " ";
cout << endl;

Alberto

Anna Simbirtsev wrote:
In the epp_eppXMLbase.cc in function createDOMDocument it calls
parseString function from domtools::XercesParser. In the dom_wrapper.c I
print the string before it is passed to the xerces-c parser in
domtools::XercesParser::parseMemory function and my value in utf-8 looks
fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
object(defined in dom_wrapper.h) to store the DOM_document and break it
into nodes. Then in epp_eppXMLbase.cc, in function
eppobject::epp::addExtensionElements(EPP_output & outputobject, const
epp_extension_ref_seq_ref & extensions)

it calls
DomPrint dp(outputobject);
dp.putDOMTree(extensionDoc);

from dom_print.cc where I actually print the value in putDOMTree
function. Here the value looks truncated.
The entire source code of domtools is available on
http://sourceforge.net/project/showfiles.php?group_id=26675

Thank you very much for your help.

On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
Anna Simbirtsev wrote:
I pass just plain xml string to the DOMParser, so I don't use the
transcode function.

[...]
I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.
If you just used cut & paste from your browser to your C++ code editor, I can bet you are not pasting UTF-8 codepoints, but something in your local code page. Can you attach your source code to this e-mail (attached, not copied)?

Alberto

Reply via email to