Re: High performance XML parser

Michel Fortin Fri, 04 Feb 2011 13:40:58 -0800

On 2011-02-04 16:02:39 -0500, Tomek Sowiński <j...@ask.me> said:

I am now intensely accumulating information on how to go about creatinga high-performance parser as it quickly became clear that my old onewon't deliver. And if anything is clear is that memory is the key.
One way is the slicing approach mentioned on this NG, notably used byRapidXML. I already contacted Marcin (the author) to ensure that usingsolutions inspired by his lib is OK with him; it is. But I don't thinkI'll go this way. One reason is, surprisingly, performance. RapidXMLcannot start parsing until the entire document is loaded and ready as arandom-access string. Then it's blazingly fast but the time for I/O hasalready elapsed. Besides, as Marcin himself said, we need a 100%W3C-compliant implementation and RapidXML isn't one.
I think a much more fertile approach is to operate on a forward range,perhaps assuming bufferized input. That way I can start parsing as soonas the first buffer gets filled. Not to mention that the end resultwill use much less memory. Plenty of the XML data stream is indents,spaces, and markup -- there's no reason to copy all this into memory.
To sum up, I belive memory and overlapping I/O latencies with parsingeffort are pivotal.

I agree it's important, especially when receiving XML over the network,but I also think it's important to be able to be able to supportslicing. Imagine all the memory you could save by just making slices ofa memory-mapped file.

The difficulty is to support both models: the input range model whichrequires copying the strings and the slicing model where you're justtaking slices of a string.



--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/

Re: High performance XML parser

Reply via email to