Hi Anna,
if I am not mistaken, the code you attached doesn't have the sample data
you are trying to parse (e.g. parseString is used to parse the result of
a toXML call on an extension object).
However, you say "in the dom_wrapper.c I print the string before it is
passed to the xerces-c parser [...] and my value in utf-8 looks fine";
in the code you write
cout << "parseString: " << str << endl;
return parseMemory(str.c_str(),(int)str.length());
But the fact that your console prints the data as you expects doesn't
imply that the std::string contains real UTF-8; your shell could be
using a Japanese locale, and be able to print correctly
Shift_JIS-encoded strings (while failing to print UTF-8-encoded strings).
If you want to really see what you are considering UTF-8, replace that
cout << str with this code
for(int i=0;i<str.length();i++)
cout << "0x" << hex << (int)str[i] << " ";
cout << endl;
Alberto
Anna Simbirtsev wrote:
In the epp_eppXMLbase.cc in function createDOMDocument it calls
parseString function from domtools::XercesParser. In the dom_wrapper.c I
print the string before it is passed to the xerces-c parser in
domtools::XercesParser::parseMemory function and my value in utf-8 looks
fine. When it gets back from xerces-c a DOM_document, it uses XercesNode
object(defined in dom_wrapper.h) to store the DOM_document and break it
into nodes. Then in epp_eppXMLbase.cc, in function
eppobject::epp::addExtensionElements(EPP_output & outputobject, const
epp_extension_ref_seq_ref & extensions)
it calls
DomPrint dp(outputobject);
dp.putDOMTree(extensionDoc);
from dom_print.cc where I actually print the value in putDOMTree
function. Here the value looks truncated.
The entire source code of domtools is available on
http://sourceforge.net/project/showfiles.php?group_id=26675
Thank you very much for your help.
On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
Anna Simbirtsev wrote:
I pass just plain xml string to the DOMParser, so I don't use the
transcode function.
[...]
I just copy utf-8 strings from wikipedia.org and paste it right
into the code to test. After I compiled the parser with ICU, it returns
the string, but shorter. My xml has UTF-8 encoding set: <?xml
version='1.0' encoding='UTF-8'?>.
If you just used cut & paste from your browser to your C++ code editor,
I can bet you are not pasting UTF-8 codepoints, but something in your
local code page. Can you attach your source code to this e-mail
(attached, not copied)?
Alberto