On 2011-02-04 16:02:39 -0500, Tomek Sowiński <j...@ask.me> said:
I am now intensely accumulating information on how to go about creating
a high-performance parser as it quickly became clear that my old one
won't deliver. And if anything is clear is that memory is the key.
One way is the slicing approach mentioned on this NG, notably used by
RapidXML. I already contacted Marcin (the author) to ensure that using
solutions inspired by his lib is OK with him; it is. But I don't think
I'll go this way. One reason is, surprisingly, performance. RapidXML
cannot start parsing until the entire document is loaded and ready as a
random-access string. Then it's blazingly fast but the time for I/O has
already elapsed. Besides, as Marcin himself said, we need a 100%
W3C-compliant implementation and RapidXML isn't one.
I think a much more fertile approach is to operate on a forward range,
perhaps assuming bufferized input. That way I can start parsing as soon
as the first buffer gets filled. Not to mention that the end result
will use much less memory. Plenty of the XML data stream is indents,
spaces, and markup -- there's no reason to copy all this into memory.
To sum up, I belive memory and overlapping I/O latencies with parsing
effort are pivotal.
I agree it's important, especially when receiving XML over the network,
but I also think it's important to be able to be able to support
slicing. Imagine all the memory you could save by just making slices of
a memory-mapped file.
The difficulty is to support both models: the input range model which
requires copying the strings and the slicing model where you're just
taking slices of a string.
--
Michel Fortin
michel.for...@michelf.com
http://michelf.com/