RE: "Unconvertible UTF-8 character beginning with 0x91"

F. Andy Seidl 14 Mar 2005 15:04:03 -0000

One approach would be to write your own Reader that "fixes" invalid UTF-8
sequences as it encounters them.  The fix could be to ignore the sequence,
or, probably better, to attempt to replace the sequence with a best guess
character.  For example, you could assume that each invalid byte is really a
Windows-1251 character and make an appropriate substitution.  It would be
important that this reader also correctly handle *valid* UTF-8 sequences.


Armed with such a reader, you could attempt to parse a document normally.
If that fails, and if the document claims to be UTF-8 encoded, then you
would retry the parse using your custom Reader (rather than letting the
parser create its own reader).

  -- fas
 F. Andy Seidl, Co-founder
MyST Technology Partners
http://myst-technology.com | http://blogsite.com 
 
 

-----Original Message-----
From: Alistair Young [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 14, 2005 8:08 AM
To: [EMAIL PROTECTED]
Subject: "Unconvertible UTF-8 character beginning with 0x91"

I wonder if anyone can suggest a way of ignoring or using a "default" 
char for the above error?

I'm using Xalan to prettify an RSS feed but one of the posts has been 
corrupted in a database upgrade on the weblog. The result is non UTF-8 
characters (they used to be single quotes) but now they're 0x91 and 
0x92.

It's using xerces to parse the rss xml. Is there any way to insert 
default chars for corrupted ones?

thanks,
Alistair


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: "Unconvertible UTF-8 character beginning with 0x91"

Reply via email to