Hi all,
This is a follow up question from my original post below about accessing file position information during a SAX2 parse.
I've been experimenting with using SAX2XMLReader::getSrcOffset() to find the file position that elements were found at. The problem with getSrcOffset() is that it doesn't always correspond exactly to the absolute file position, as (for example) XMLReader will sometimes silently eat the LF portion of a CR/LF pair, only adding 1 to the offset count rather than 2.
I guess it's not really feasible to use Xerces to report the physical location of elements withing a file, unless there's another solution that I'm not aware of.
So, my new question is, what would be involved in dynamically switching the ContentHandler attached to a SAX2 parser, so that some portions of a document could be parsed into a DOM document, and others handled by a custom implementation of ContentHandler. Essentially I want to switch to a DOM builder when I encounter a <Header> element, then switch back to my own ContentHandler implementation once the Header element has been parsed.
Does anyone have any advice, or references to any relevent documentation?
Thanks in advance,
Pete
Pete Hodgson wrote:
Hi everyone,
I was hoping for some advice regarding a problem my team is facing related to SAX parsing in Xerces-C++. I'm new to Xerces, and SAX in general, so please forgive any stupidity!
The application we're developing is processing /very/ large XML files that contain time-series data looking something like this:
<Root> <Header> <SomeMetaData> <SomeMoreMetaData> ... ... </Header>
<Frame id="1"> <LotsOfData> <LotsMoreData> <YetMoreData> ... </Frame> <Frame id="2"> ... </Frame> <Frame id="3"> ... </Frame> ... ... ... </Root>
We've been using progressive parsing SAX to read the <Frame> data from these XML files, which works great because we can deal with it as a stream without having to read the entire file up front.
We've also been using the MSXML DOM implementation to read <Header> data with the same Schema as the <Header> element in the time-series files, but from other, small files.
The problem now is that we wish to access the <Header> data in these extremely large files. We don't want to use DOM to parse the entire file (for efficiency issues), but we'd like to re-use the existing DOM-based implementation that we have for reading the <Header> schema (rather than implementing a new SAX parser for the <Header>).
So, I guess my question is, is there a way to discover the exact file location of an Element as it's encountered during a SAX parse? If we could get the location we could manually read the entire <Header> section into a string and DOM-parse the string. We'd also like to be able to access file location information for other reasons, such as to pre-parse the files and build a 'look up table' for the XML file, so that a particular section of the time series can be read in on demand with the help of a custom LocalFileSource.
The closest thing I've found is Locator, but that doesn't help because it gives you a line and column, rather than an absolute location within the file. I looked into peeking at the BinInputStream that the SAX2XMLReader is using, but that doesn't work because the stream is read in chunks, so calling BinInputStream::curPos() when the Header element is encountered doesn't supply the exact location either. I know that that would have been a kludgy solution anyways, but it would have served our purposes.
Any suggestions on how best to solve this one?
Many Thanks,
Pete Hodgson
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]