Re: read byte offset information during xml parsing

Andy Clark 4 Jan 2005 19:36:48 -0000

Xiaoming Liu wrote:

It seems to me that in order to fast access very large XML files, byte
offset is an efficient way. Probably it's also doable by character offset,
however I didn't know a java class providing character-based random
access.


Reporting byte offsets is just not possible given a number
of factors. The primary factor being that Xerces does not
control the decoding of the source bytes to characters.

In order to maintain the proper byte location within the
stream, the parser would need to know *exactly* how many
bytes were read by the underlying input stream. Since we
rely on the decoders present in Java, this just isn't
possible.

I once had the idea of putting a special byte-counting
input stream filter between the underlying stream and the
reader that converted the bytes to chars. I thought that
I could read one char at a time and then look at the
current byte offset to see how many bytes were actually
used to encode that single character. But this didn't
work.

It turned out that the underlying decoders were buffering
internally. So even if I asked the reader for a single char,
the reader would buffer 1K or 2K of data. This fact makes
it impossible to do anything using the default readers to
report true byte offsets.

By the way, the java-based XP parser does provide a way to locate
byteoffset of starting element event, oddly, it doesn't provide ways to
locate endElement and other events [1]. Since XP is not actively


My guess is that the byte offsets reported by XP are only
valid for specific encodings: fixed byte length encodings
like 1-byte or 2-byte character encodings (e.g. US-ASCII,
ISO-8859-1, UTF-16, UCS4, etc) *and* assuming no Unicode
character normalization. Unless, of course, that he does
the character conversions himself and can keep track of
the true byte offsets.

If you still need to report byte offsets with Xerces, I
think the only way to do it properly is to pre-normalize
your docs to a fixed length encoding and then use the
Xerces feature that reports character offsets. Then it's
just a matter of multiplying the character offset by the
char "width" of that encoding. Make sense?

--
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: read byte offset information during xml parsing

Reply via email to