Re: High performance XML parser

Robert Jacques Mon, 07 Feb 2011 07:42:10 -0800

On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer<schvei...@yahoo.com> wrote:

On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <j...@ask.me> wrote:
Steven Schveighoffer napisał:
Here is how I would approach it (without doing any research).

First, we need a buffered I/O system where you can easily access and
manipulate the buffer. I have proposed one a few months ago in thisNG.
Second, I'd implement the XML lib as a range where "front()" gives youanXMLNode. If the XMLNode is an element, it will have eager access tothe
element tag, and lazy access to the attributes and the sub-nodes.  Each
XMLNode will provide a forward range for the child nodes.
Thus you can "skip" whole elements in the stream by popFront'ing arange,
and dive deeper via accessing the nodes of the range.

I'm unsure how well this will work, or if you can accomplish all of it
without reallocation (in particular, you may need to store the element
information, maybe via a specialized member function?).
Heh, yesterday when I couldn't sleep I was sketching the design. Iconverged to a pretty much same concept, so your comment is reassuring:).
The design I'm thinking is that the node iterator will own a buffer.One consequence is that the fields of the current node will point tothe buffer akin to foreach(line; File.byLine), so in order to lift theinput the user will have to dup (or process the node in-place). As newnodes will be overwritten on the same piece of memory, an importanttrait of the design emerges: cache intensity. Because of XML namespacesI think it is necessary for the buffer to contain the current node plusall its parents.
That might not scale well. For instance, if you are accessing the1500th child element of a parent, doesn't that mean that the buffer mustcontain the full text for the previous 1499 elements in order to alsocontain the parent?
Maybe I'm misunderstanding what you mean.
I would start out with a non-compliant parser, but one that allocatesnothing beyond the I/O buffer, one that simply parses lazily and can beused as well as a SAX parser. Then see how much extra allocations weneed to get it to be compliant. Then, one can choose the compliancylevel based on what performance penalties one is willing to incur.
-Steve

I would consider a tokenizer which can be used for SAX style parsing to bea key feature of std.xml. I know it was considered very important when Iwas gathering requirements for my std.JSON re-write.

Re: High performance XML parser

Reply via email to