Thanks David, that did the trick.
I do have one question though - is the offset a Unicode character count or a char/byte count? I'm assuming it's Unicode, and if so I guess that's going to cause problems extracting the correct amount of data out of the XML file. Possibly we could hook up a xerces DOM parser that parses fragments specified by the offsets. Is that feasible?
Thanks again for your help,
Pete
David Cargill wrote:
Hi Pete, For SAX2 try using setFeature(XMLUni::fgXercesCalculateSrcOfs, true).
Regards, David A. Cargill XML Parser Development IBM Toronto Lab (905) 413-2371, tie 969 [EMAIL PROTECTED]
Pete Hodgson <[EMAIL PROTECTED] epete.net> To [EMAIL PROTECTED] 11/16/2004 12:52 cc PM Subject Re: Accessing file position Please respond to information during a SAX parse xerces-c-dev
I've tried using SAX2XMLReader::getSrcOffset(), but XmlReader::getSrcOffset() throws a Reader_SrcOfsNotSupported exception.
Do I need to explicitly tell the parser to maintain source offset information? I noticed that SAXParser has a setCalculateSrcOfs() method, but I can't find an equivalent for SAX2XMLReader. Do I need to choose a specific scanner maybe?
Any help would be greatly appreciated!
Cheers,
Pete
Erik Rydgren wrote:
Try this path: SAXParser().getScanner().getSrcOffset()
The problem is that the getScanner method is protected. You might inherit the SAXParser into your own class to get access.
But it should give you the number of characters eaten by the XMLReader. That is the current fileposition.
/ Erik
-----Original Message----- From: Pete Hodgson [mailto:[EMAIL PROTECTED] Sent: den 16 november 2004 16:53 To: [EMAIL PROTECTED] Subject: Accessing file position information during a SAX parse
Hi everyone,
I was hoping for some advice regarding a problem my team is facing related to SAX parsing in Xerces-C++. I'm new to Xerces, and SAX in general, so please forgive any stupidity!
The application we're developing is processing /very/ large XML files that contain time-series data looking something like this:
<Root> <Header> <SomeMetaData> <SomeMoreMetaData> ... ... </Header>
<Frame id="1"> <LotsOfData> <LotsMoreData> <YetMoreData> ... </Frame> <Frame id="2"> ... </Frame> <Frame id="3"> ... </Frame> ... ... ... </Root>
We've been using progressive parsing SAX to read the <Frame> data from these XML files, which works great because we can deal with it as a stream without having to read the entire file up front.
We've also been using the MSXML DOM implementation to read <Header>
data
with the same Schema as the <Header> element in the time-series files, but from other, small files.
The problem now is that we wish to access the <Header> data in these extremely large files. We don't want to use DOM to parse the entire
file
(for efficiency issues), but we'd like to re-use the existing
DOM-based
implementation that we have for reading the <Header> schema (rather
than
implementing a new SAX parser for the <Header>).
So, I guess my question is, is there a way to discover the exact file location of an Element as it's encountered during a SAX parse? If we could get the location we could manually read the entire <Header> section into a string and DOM-parse the string. We'd also like to be able to access file location information for other reasons, such as to pre-parse the files and build a 'look up table' for the XML file, so that a particular section of the time series can be read in on demand with the help of a custom LocalFileSource.
The closest thing I've found is Locator, but that doesn't help because it gives you a line and column, rather than an absolute location
within
the file. I looked into peeking at the BinInputStream that the SAX2XMLReader is using, but that doesn't work because the stream is
read
in chunks, so calling BinInputStream::curPos() when the Header element is encountered doesn't supply the exact location either. I know that that would have been a kludgy solution anyways, but it would have
served
our purposes.
Any suggestions on how best to solve this one?
Many Thanks,
Pete Hodgson
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]