On Mon, 07 Feb 2011 07:40:30 -0500, Steven Schveighoffer <schvei...@yahoo.com> wrote:

On Fri, 04 Feb 2011 17:36:50 -0500, Tomek Sowiński <j...@ask.me> wrote:

Steven Schveighoffer napisał:

Here is how I would approach it (without doing any research).

First, we need a buffered I/O system where you can easily access and
manipulate the buffer. I have proposed one a few months ago in this NG.

Second, I'd implement the XML lib as a range where "front()" gives you an XMLNode. If the XMLNode is an element, it will have eager access to the
element tag, and lazy access to the attributes and the sub-nodes.  Each
XMLNode will provide a forward range for the child nodes.

Thus you can "skip" whole elements in the stream by popFront'ing a range,
and dive deeper via accessing the nodes of the range.

I'm unsure how well this will work, or if you can accomplish all of it
without reallocation (in particular, you may need to store the element
information, maybe via a specialized member function?).

Heh, yesterday when I couldn't sleep I was sketching the design. I converged to a pretty much same concept, so your comment is reassuring :).

The design I'm thinking is that the node iterator will own a buffer. One consequence is that the fields of the current node will point to the buffer akin to foreach(line; File.byLine), so in order to lift the input the user will have to dup (or process the node in-place). As new nodes will be overwritten on the same piece of memory, an important trait of the design emerges: cache intensity. Because of XML namespaces I think it is necessary for the buffer to contain the current node plus all its parents.

That might not scale well. For instance, if you are accessing the 1500th child element of a parent, doesn't that mean that the buffer must contain the full text for the previous 1499 elements in order to also contain the parent?

Maybe I'm misunderstanding what you mean.

I would start out with a non-compliant parser, but one that allocates nothing beyond the I/O buffer, one that simply parses lazily and can be used as well as a SAX parser. Then see how much extra allocations we need to get it to be compliant. Then, one can choose the compliancy level based on what performance penalties one is willing to incur.

-Steve

I would consider a tokenizer which can be used for SAX style parsing to be a key feature of std.xml. I know it was considered very important when I was gathering requirements for my std.JSON re-write.

Reply via email to