Polk, John R. wrote:

Aleksander,
I found a post similar to a problem I am having with your name associated with it. I was wondering if you found a clean solution to your issue. I am writing a client application that is trying to parse multiple XML responses over the same socket connection. I have a problem parsing the second response because it starts with "<?xml...". I was hoping to be able to cleanly reset the parser between response messages, but I have not succeeded. Do you have any suggestions?

hi John,

(i CCed [email protected] as the same question was raised - [1] below)

it looks to me that the key thing to notice that your input has two layers - first there are high order markers (<?xml version="1.0" encoding="UTF-8"?>) that separate actual XML document - higher level layer is *not* XML so it should be processed without XML passer (good that <?xml is reserved and not used except for <![CDATA -to be 100% correct you would need to scan for <![CDATA[ too as CDATA can contain anything http://www.w3.org/TR/REC-xml/#sec-cdata-sect)

if you have control over format (it seems you do not?) i would suggest to use something similar to HTTP chunked encoding so you write size of every output chunks before writing it and mark some chunk as last in the chain (size=-1) so you know when a document is finished and so on - this would lead to much more efficient streaming (and allows you to send metadata about files if you put it besides chunk sizes) as no string patterns is needed and no worry about CDATA as layering is completely transparent (using <?xml ...?> is not very good without XML independent markers ...)

for the particular enveloping scheme you have (using <?xml ... as marker): a simple solution i would use is to create a composite reader (or input stream) that is buffered. the first thing it does it sets a mark (starts internal buffering) and then you scan its input for next '<?xml version="1.0" encoding="UTF-8"?>\n<root' (as you know it is the marker for 2nd doc) and then it creates new reader that will only allow to read the input until 2nd document beginning or EOF (so it is content 1st doc) and passes that reader to xml parser. then the process is repeated: new mark is set and scan for next <?xml...?> which is beginning of 3rd document (or EOF).

that is it - it should be fairly efficient as the key for IO performance is to read data in chunks and avoid copying (here not much is done but more advanced version could actually work in streaming pipeline and hook into MultiXmlCompositeReader.read(...) to actually scan for end of document marker (or EOF) so there is no memory overhead as only chunks (possibly multiple read()s to discover marker as it read() may only get it partially) need to be buffered then and not whole document (and only one buffering is done in MultiXmlCompositeReader and other buffering in xmlparser but that is hard to avoid)


in pseudo code
   InputStream in =
MultiXmlCompositeReader mr = new MultiXmlCompositeReader(new InputStreamReader( in, "UTF8" ))
   Reader r;
   while( (r = mr.nextDocumentReader()) != null) {
     xmlparser.setInput(r);
     xmlparser.parse() ...
  }

still you should add then CDATA scanning to make it completely correct.

however if you are concerned about correctness and if you want to handle all in one xml parser stream i would instead write MultiXmlCompositeReader to actually transform stream as follow

ORIG:
  <?xml version="1.0" encoding="UTF-8"?>
  <root/>
  <?xml version="1.0" encoding="UTF-8"?>
  <root2/>

TRANSFORMED
  <super-root>
  <root/>
  <root2/>
  </super-root>

i.e. add wrappers XML elements (<super-root/>) and remove all <?xml...?> when you see '<?xml...?>\n<root'

this can be also done as streaming reader/filter with careful coding (especially if XML content is signed and you want to make sure that CDATA content with <?xml ...?> is not modified ....)

HTH

alek

[1] Massimo Valla wrote:

Hi Michael.
Thank you for your reply. I definitely agree on your point. The protocol is awful. But, unfortunately I cannot change the server side nor the protocol. I could assume that each document ends when the root tag is closed. So your example could be parsed and received as two documents: 1st doc:
   <?xml version="1.0" encoding="UTF-8"?>
   <root/>
2nd doc
   <?xml version="1.0" encoding="UTF-8"?>
   <root2/>
leaving out the comment as not beloning to any of the two docs.
The problem is that with Xerces as soon as I receive the first end tag SAX notification, the parser has already buffered part of the other XML message, so starting another parse command on the inputstream will not work. How can I set a simular solution to FAQ-11 (of Xerces1) in Xerces2 ?? More generally, how can I write a client with Xerces that is able to parse mutiple XML coming from the socket? (I have also tryed other parsers: they allow char-by-char parsing and they would not close the inputstream after a parse error, so I would be fine using them. But I would very much prefer to stay with Xerces as it is the parser used in Java 1.5...) Thanks a lot,
Massimo
On 2/12/06, Michael Glavassevich <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> Hi Massimo,
>
> The KeepSocketOpen sample works because the server socket tells the client
> how many bytes there are in the document. If the server has no protocol
> for communicating the boundaries between XML documents, how can you tell
> where one begins and another ends?
>
> Consider if your client receives this from the socket:
>
> <root/>
> <!-- comment -->
> <root2/>
>
> How would you know whether you've received two documents or one not
> well-formed document containing multiple root elements? And if this is
> processed as two documents does the comment belong to the first or the
> second? Only the sender could know that.
>
> Thanks.
>
> Michael Glavassevich
> XML Parser Development
> IBM Toronto Lab
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>
> Massimo Valla < [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote on 02/04/2006 09:54:25 PM:
>
> > Hi,
> > I am trying to read multiple XML files from a socket using JAXP 1.3
> > / Xerces-J 2.7.1.
> >
> > Unfortunately the KeepSocketOpen example in Xerces2 Socket Sample (
> > http://xerces.apache.org/xerces2-j/samples-socket.html) does not
> > work for me, because I have no control over the other side of the
> socket.
> >
> > Also FAQ-11 of Xerces1 ( http://xerces.apache.org/xerces-j/faq-
> > write.html#faq-11) does not help anymore, because the
> > StreamingCharFactory class used there to prevent buffering cannot be
> > used in Xerces2 (cannot compile the class).
> >
> > I have been trying to find a solution to this for a while now, but I
> > could come to an end.
> >
> > Can anybody provide a simple example on how to read multiple XML
> > docs from a socket InputStream?
> >
> > Thanks a lot,
> > Massimo
>



--
The best way to predict the future is to invent it - Alan Kay



--
The best way to predict the future is to invent it - Alan Kay


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to