Tomek Sowiński wrote:
I am now intensely accumulating information on how to go about creating a 
high-performance parser as it quickly became clear that my old one won't 
deliver. And if anything is clear is that memory is the key.

One way is the slicing approach mentioned on this NG, notably used by RapidXML. 
I already contacted Marcin (the author) to ensure that using solutions inspired 
by his lib is OK with him; it is. But I don't think I'll go this way. One 
reason is, surprisingly, performance. RapidXML cannot start parsing until the 
entire document is loaded and ready as a random-access string. Then it's 
blazingly fast but the time for I/O has already elapsed. Besides, as Marcin 
himself said, we need a 100% W3C-compliant implementation and RapidXML isn't 
one.

I think a much more fertile approach is to operate on a forward range, perhaps 
assuming bufferized input. That way I can start parsing as soon as the first 
buffer gets filled. Not to mention that the end result will use much less 
memory. Plenty of the XML data stream is indents, spaces, and markup -- there's 
no reason to copy all this into memory.

To sum up, I belive memory and overlapping I/O latencies with parsing effort 
are pivotal.

Please comment on this.


Few years ago I needed to write my own parser in Delphi for my XMPP (Jabber) client and server. In XMPP you get socket-streamed XML document with xml elements as protocol messages ("XMPP stanzas"). The problem I had was no Delphi parser had hybrid support of SAX/DOM, i.e. I wanted to parse xml nodes like SAX, but when I received whole message then I wanted to return it as XML Element (like in DOM). This way I could easily process incoming messages - now it's accomplishable with pull parsers.

I think std needs both SAX/pull and DOM parsers. For DOM, if whole document is in memory, maybe this approach could be advantageous:

http://en.wikipedia.org/wiki/VTD-XML
http://vtd-xml.sourceforge.net/

Reply via email to