Here is an example of what I see before I pass the string to the xerces-c parser and after:
<Returned_XML> <parseme xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><ipr:create xmlns:ipr="urn:afilias:params:xml:ns:ipr-1.1 " xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:afilias:params:xml:ns:ipr-1.1 ipr-1.1.xsd"><ipr:name>édia</ipr:name><ipr:number>12345566</ipr:number><ipr:ccLocality>CA</ipr:ccLocality><ipr:regDate>2001-01-01</ipr:regDate><ipr:appDate>2002-01-01</ipr:appDate><ipr:class>1</ipr:class><ipr:entitlement>owner</ipr:entitlement><ipr:form>corporation</ipr:form><ipr:preVerified>code</ipr:preVerified><ipr:type>sunrise</ipr:type></ipr:create> </parseme> </Returned_XML> The utf-8 string is <ipr:name>édia</ipr:name>. When it comes back from the parser in the form of DOM_document and extract the node and print its value see: value: édi value: 12345566 value: CA value: 2001-01-01 value: 2002-01-01 value: 1 value: owner value: corporation value: code value: sunrise So it got rid of last character 'a' in édia. On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote: > Anna Simbirtsev wrote: > > I pass just plain xml string to the DOMParser, so I don't use the > > transcode function. > > > > [...] > > I just copy utf-8 strings from wikipedia.org and paste it right > > into the code to test. After I compiled the parser with ICU, it returns > > the string, but shorter. My xml has UTF-8 encoding set: <?xml > > version='1.0' encoding='UTF-8'?>. > > > > If you just used cut & paste from your browser to your C++ code editor, > I can bet you are not pasting UTF-8 codepoints, but something in your > local code page. Can you attach your source code to this e-mail > (attached, not copied)? > > Alberto
