One semi-reasonable way to deal with it, and I didn't do it originally because I could not necessarily see ahead that it would remain reasonable because of the need to support third party transcoding packages, is that the first time you see an exception from the transcoder, eat it and set a flag if it managed to trancode more than some small number of characters, take what it managed to get done and go on. When you see a second one, then report it. It'll let you get closer to the actual error, but it's impossible to guarantee that you can report it exactly, because the thing you are trying to parse might be 50 characters long, and cover 5 lines in the original content and you can't get through it, so the reported error position will be at the place where you started parsing that thing.
The other thing is to at least try to give an offset into the original buffer, in bytes, of the transcoding error position, but it'll always be impossible to be really accurate on reporting transcoding errors in terms of parsed XML content. And having a byte offset might do little good for most folks if the text is in some complex encoding. ------------------------------------- Dean Roddey The Charmed Quark Controller [EMAIL PROTECTED] www.charmedquark.com -----Original Message----- From: Neil Graham [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 23, 2003 5:37 PM To: [EMAIL PROTECTED] Subject: Re: Possible bug: invalid byte 1 (...) of a 1-byte sequence. Hi Matt, I can reproduce the behaviour you experience, and the reason is this: When the parser is reading UTF-8 from some source, it reads it in chunks to maximize performance as much as possible. The routines that look through the markup performing tokenization, well-formedness checking etc. operate on this internal buffer--where everything's already in UTF-16. The error reporting routines work relative to the routines that are concerned with the XML markup, since those are where most problems arise and that's the natural specific domain of an XML parser. When the markup routines have finished the XML declaration, they'll ask for more text, which will cause the transcoding routines to go merrily along their way to fill the requisite buffer. When the transcoder finds something it can't stomach it complains, but the error reporting logic only knows where the parser left off looking for markup. So yes, this is a bug. But it wouldn't be all that easy to fix, especially for transcoders that we don't own. So I'm afraid the probability of this being addressed in the near future isn't high. You might want to file a bugzilla report to keep this on the radar scope, in case anyone ever has the cycles to give it a serious run. Cheers, Neil Neil Graham XML Parser Development IBM Toronto Lab Phone: 905-413-3519, T/L 969-3519 E-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
