Re: use of utf-8 with SAX

David_N_Bertoni Fri, 22 Jun 2001 07:05:07 -0700

UTF-8 can represent any Unicode character, so there should be no problem as
long as the characters in the document _are_ valid.

The usual reason the parser gives the error is because there's an invalid
UTF-8 byte sequence, at which point the parser ends up consuming the
beginning of the markup for the closing tag.  You should check the bytes in
that element's content to make sure you actually have correctly encoded
characters.

Interestingly enough, I was just about to post a bug because the parser
accepts invalid UTF-8 byte sequences without reporting an error.  It's also
possible that the parser's UTF-8 decoder has a bug with some byte
sequences, but it's been around for a while, so it's been through a lot of
testing.

Dave



                                                                                       
                       
                    tbentley@iris                                                      
                       
                    .com                 To:     [EMAIL PROTECTED]           
                       
                                         cc:     (bcc: David N Bertoni/CAM/Lotus)      
                       
                    06/22/2001           Subject:     Re: use of utf-8 with SAX        
                       
                    08:12 AM                                                           
                       
                    Please                                                             
                       
                    respond to                                                         
                       
                    xerces-c-dev                                                       
                       
                                                                                       
                       
                                                                                       
                       




Not sure if this is accurate, but I thought some Asian languages could not
be represent in UTF-8.  Can someone confirm this?  Is there a way to escape
the problem character(s)?

Regards,
Thom Bentley
Iris Associates, 5 Technology Park Drive, Westford, MA 01886, 617-693-9210,


                                                                           
   "KELLEHER,KEVIN                                                         
   (Non-HP-Roseville,ex1)"                       To:                       
   <[EMAIL PROTECTED]>           "'[EMAIL PROTECTED]'"   
                                         <[EMAIL PROTECTED]>     
                                                 cc:                       
   06/21/2001 05:58 PM                           Subject:        use of    
   Please respond to xerces-c-dev        utf-8 with SAX                    
                                                                           






I am having some trouble with Asian-language data in the SAX parser.

Specifically, some data that is originally in Taiwanese (roc15)
is converted to utf-8 and embedded in an XML message.
All the tags and attributes, etc. are in English, all the data is
in Taiwanese.

The problem occurs when I use the SAX parser to validate the message:
it hits a piece of data that it interprets as end-of-data, and complains
that it can't find the end tag that should follow the data.

I get this error in versions 1.3 and 1.5, in my own code and when I
run my data through the sample programs (i.e., SAXPrint, SAX2Print,
SAXCount, etc.).

Several people familiar with the language have confirmed the fitness of the
data.

My code is modeled after the SAXPrint example - is there anything missing
there for processing Asian language data written in utf-8?


Kevin Kelleher

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To
unsubscribe, e-mail: [EMAIL PROTECTED] For additional
commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Re: use of utf-8 with SAX

Reply via email to