Hello Everyone!

I am trying to write an application that has to parse a sequence of XML
documents (thousands of them) from the file/stream. Every document in
the sequence should be a well-formed XML, but they are not necessarily
in the same encoding. The stream will look somewhat like this:

:BEGIN EXAMPLE: 

<?xml version="1.0" encoding="utf-8"?>
<document id="1">
  ... content ...
</document>

<?xml version="1.0" encoding="iso-8859-1"?>
<document id="2">
  ... content ...
</document>

...

:END EXAMPLE:

The problem is, that if there is a well-formness error in any of the
documents, I don't want to discard the whole stream, since there may be
thousands of good well-formed XML documents in it.  I want to discard
just one document, but try to recover and continue parsing the next one.

Anyone has any suggestions on how to do it "the right way"? 

I was thinking of deriving my own InputSource class, that will be
similar to LocalFileInputSource, but will keep reusing the same
BinFileInputStream object for every makeStream() call. Then supply this
InputSource to SAX2XMLReader::parse(), reset SAX2XMLReader after the doc
is complete, and call parse() again and again ...

This should work fine (I haven't tried it yet, though) if all documents
in the stream are well-formed. If not, parser will die half-way through
the document. At this point I will have to recover by searching for the
closing </document> tag, to start parsing next document right after it.
But in order to do that I need to know what encoding the malformed
document was in. Is there any way to get access to that info?

I can see other problems with such approach too (e.g. what if
well-formness error is even before the opening <document> tag?), and
therefore I am wondering if I am at all on the right path. 

Any advice on this is really appreciated. Thanks a lot,

        -- Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to