Here is an example of what I see before I pass the string to the
xerces-c parser and after:

<Returned_XML>
<parseme
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";><ipr:create
xmlns:ipr="urn:afilias:params:xml:ns:ipr-1.1 "
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="urn:afilias:params:xml:ns:ipr-1.1
ipr-1.1.xsd"><ipr:name>édia</ipr:name><ipr:number>12345566</ipr:number><ipr:ccLocality>CA</ipr:ccLocality><ipr:regDate>2001-01-01</ipr:regDate><ipr:appDate>2002-01-01</ipr:appDate><ipr:class>1</ipr:class><ipr:entitlement>owner</ipr:entitlement><ipr:form>corporation</ipr:form><ipr:preVerified>code</ipr:preVerified><ipr:type>sunrise</ipr:type></ipr:create>
</parseme>

</Returned_XML>

The utf-8 string is <ipr:name>édia</ipr:name>.

When it comes back from the parser in the form of DOM_document and
extract the node and print its value  see:

value: édi
value: 12345566
value: CA
value: 2001-01-01
value: 2002-01-01
value: 1
value: owner
value: corporation
value: code
value: sunrise

So it got rid of last character 'a' in édia. 

On Wed, 2008-09-17 at 08:19 +0200, Alberto Massari wrote:
> Anna Simbirtsev wrote:
> > I pass just plain xml string to the DOMParser, so I don't use the
> > transcode function.
> >
> > [...]
> > I just copy utf-8 strings from wikipedia.org and paste it right
> > into the code to test. After I compiled the parser with ICU, it returns
> > the string, but shorter. My xml has UTF-8 encoding set: <?xml
> > version='1.0' encoding='UTF-8'?>.
> >   
> 
> If you just used cut & paste from your browser to your C++ code editor, 
> I can bet you are not pasting UTF-8 codepoints, but something in your 
> local code page. Can you attach your source code to this e-mail 
> (attached, not copied)?
> 
> Alberto

Reply via email to