One approach would be to write your own Reader that "fixes" invalid UTF-8 sequences as it encounters them. The fix could be to ignore the sequence, or, probably better, to attempt to replace the sequence with a best guess character. For example, you could assume that each invalid byte is really a Windows-1251 character and make an appropriate substitution. It would be important that this reader also correctly handle *valid* UTF-8 sequences.
Armed with such a reader, you could attempt to parse a document normally. If that fails, and if the document claims to be UTF-8 encoded, then you would retry the parse using your custom Reader (rather than letting the parser create its own reader). -- fas F. Andy Seidl, Co-founder MyST Technology Partners http://myst-technology.com | http://blogsite.com -----Original Message----- From: Alistair Young [mailto:[EMAIL PROTECTED] Sent: Monday, March 14, 2005 8:08 AM To: [EMAIL PROTECTED] Subject: "Unconvertible UTF-8 character beginning with 0x91" I wonder if anyone can suggest a way of ignoring or using a "default" char for the above error? I'm using Xalan to prettify an RSS feed but one of the posts has been corrupted in a database upgrade on the weblog. The result is non UTF-8 characters (they used to be single quotes) but now they're 0x91 and 0x92. It's using xerces to parse the rss xml. Is there any way to insert default chars for corrupted ones? thanks, Alistair --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
