You can make sure that the xml document has the correct encoding in the XML
declaration:

<?xml version="1.0" encoding="ISO-8859-1"?>

Without the encoding present, the parser assumes utf-8 or utf-16, which is
why it's "eating" those two characters.  A document which does not have the
correct encoding is not an XML document -- it's just a document that's
pretending to be XML.

There might be some hack available to force the parser to use a particular
encoding, but you're better off asking your content provider to generate a
proper XML document.

Dave



                                                                                       
                       
                      "Jeremy Sheeley"                                                 
                       
                      <jeremy@sourcege         To:      <[EMAIL PROTECTED]>  
                       
                      ar.com>                  cc:      (bcc: David N 
Bertoni/CAM/Lotus)                      
                                               Subject: Transcoding ISO-8859-1 
(Latin1) help needed           
                      11/30/2001 03:56                                                 
                       
                      PM                                                               
                       
                      Please respond                                                   
                       
                      to xerces-c-dev                                                  
                       
                                                                                       
                       
                                                                                       
                       



I have an XML document (WML, actually), that has this string in it:

Fran?ais

Note the C is the squigly c that is not in ascii.  It's in Latin1, and it's
represented in hex by E7.  I know this because I did an hexdump on the
file,
and that was the byte where the character is.

When I parse this with UnRep_Throw set, I get this exception.

Fatal Error: An exception occured! Type:TranscodingException,
Message:Unicode char 0x6829 is not representable in encoding

When I parse it with UnRep_RepChar, I get back "Frans".  I know that it's
just eating the two characters after the strange one, because I tried
"Fran?
ais" and got "Franais".

Since it's WML, I tried specifying the character as "Fran&#xE7;ais", which
worked great.  I can't gaurantee that no content provider is going not
going
to put Latin1 characters in their content, so what can I do to make sure
that the transcoder can represent the strange character.

I'm creating my transcoder like this:
  transcoder =
XMLPlatformUtils::fgTransService->makeNewTranscoderFor("ISO-8859-1",
resCode, 8192);

and calling the transcodeTo method like this:

  transcoder->transcodeTo(toTranscode, (unsigned
int)XMLString::stringLen(toTranscode), bufToFill, 8192, bytesEaten,
unRepOpts);

Thanks for any help you can give me.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to