One semi-reasonable way to deal with it, and I didn't do it originally
because I could not necessarily see ahead that it would remain reasonable
because of the need to support third party transcoding packages, is that the
first time you see an exception from the transcoder, eat it and set a flag
if it managed to trancode more than some small number of characters, take
what it managed to get done and go on. When you see a second one, then
report it. It'll let you get closer to the actual error, but it's impossible
to guarantee that you can report it exactly, because the thing you are
trying to parse might be 50 characters long, and cover 5 lines in the
original content and you can't get through it, so the reported error
position will be at the place where you started parsing that thing.

The other thing is to at least try to give an offset into the original
buffer, in bytes, of the transcoding error position, but it'll always be
impossible to be really accurate on reporting transcoding errors in terms of
parsed XML content. And having a byte offset might do little good for most
folks if the text is in some complex encoding.

-------------------------------------
Dean Roddey
The Charmed Quark Controller
[EMAIL PROTECTED]
www.charmedquark.com
 


-----Original Message-----
From: Neil Graham [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 23, 2003 5:37 PM
To: [EMAIL PROTECTED]
Subject: Re: Possible bug: invalid byte 1 (...) of a 1-byte sequence.






Hi Matt,

I can reproduce the behaviour you experience, and the reason is this:  When
the parser is reading UTF-8 from some source, it reads it in chunks to
maximize performance as much as possible.  The routines that look through
the markup performing tokenization, well-formedness checking etc. operate on
this internal buffer--where everything's already in UTF-16.  The error
reporting routines work relative to the routines that are concerned with the
XML markup, since those are where most problems arise and that's the natural
specific domain of an XML parser.

When the markup routines have finished the XML declaration, they'll ask for
more text, which will cause the transcoding routines to go merrily along
their way to fill the requisite buffer.  When the transcoder finds something
it can't stomach it complains, but the error reporting logic only knows
where the parser left off looking for markup.

So yes, this is a bug.  But it wouldn't be all that easy to fix, especially
for transcoders that we don't own.  So I'm afraid the probability of this
being addressed in the near future isn't high.

You might want to file a bugzilla report to keep this on the radar scope, in
case anyone ever has the cycles to give it a serious run.

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to