Hi David,
Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
C APIs. Actually i took these chinese characters(您是如) from Goolge Translate
and used in xml file to test the unicode support.When i translated these
characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
Now i am not getting the errors from xerces parser.
But i have a question. Will the characters themselves change from one format
to another format? If i have a string "abcd", will it change from one format
to another format? I understand the encoding in different formats is
different but i do not understand why the characters themselves are chaning
from one format to another format. Any information related to this will be a
great help to me.
Thanks,
Jaya Nageswar.
On Wed, Sep 3, 2008 at 3:18 AM, David Bertoni <[EMAIL PROTECTED]> wrote:
> Jaya Nageswar wrote:
>
>> Hi,
>>
>> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
>> special chinese characters in the xml file. So i am using ICU build to
>> support unicode. I defined encoding as UTF-8
>>
>> *<?xml version="1.0" encoding="UTF-8"?>*
>>
>> Part of xml file contains the has the following chinese characters.
>> * <Convert>
>> <FromValue>TRUE</FromValue>
>> <ToValue>您是如</ToValue>
>> </Convert>
>> <Convert>
>> <FromValue>FALSE</FromValue>
>> <ToValue>您好</ToValue>
>> </Convert>*
>>
>> I am using DOM to prase the xml file. I have the following code for DOM
>> parsing
>>
>> * static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
>> DOMImplementation *impl =
>> DOMImplementationRegistry::getDOMImplementation(gLS);
>> DOMBuilder *CtlParser =
>>
>> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
>> 0);*
>>
>> * CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
>> CtlParser->setFeature(XMLUni::fgXercesSchema, true);
>> CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
>> CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
>>
>> * //create our error handler and install it
>> XMLErrorHandler errorHandler;
>> CtlParser->setErrorHandler(&errorHandler);
>>
>> CtlDoc = CtlParser->parseURI(XMLFilePath);
>> if(errorHandler.getSawErrors())
>> {
>> cout<<errorHandler.ReturnErrorMessage();
>> } *
>>
>>
>> I am getting the following error.
>> *Message: An exception occurred! Type:UTFDataFormatException,
>> Message:invalid byte 2 (�) of a 2-byte sequence.*
>>
> This indicates your file is not really encoded in UTF-8.
>
>
>> I do not understand why i am getting this error even though i am using
>> xercec-c ICU build. ICU build is supposed to work with unicode characters.
>> If i remove the chinese characters, i am not getting any error message
>> while
>> parsing.
>>
> Xerces-C supports UTF-8 even without using the ICU transcoders.
>
>
>> If any body worked with unicode in xerces-c, please help me. Did i miss
>> any
>> of the parser settings for unicode?
>>
> Your file is not encoded in UTF-8, so the parser reports an error. You can
> either fix the file so it's properly encoded, or update the encoding in the
> XML declaration to reflect the actual encoding.
>
> Dave
>