Re: High performance XML parser

Tomek Sowiński Tue, 08 Feb 2011 16:20:58 -0800

Steven Schveighoffer napisał:

> > The design I'm thinking is that the node iterator will own a buffer. One  
> > consequence is that the fields of the current node will point to the  
> > buffer akin to foreach(line; File.byLine), so in order to lift the input  
> > the user will have to dup (or process the node in-place). As new nodes  
> > will be overwritten on the same piece of memory, an important trait of  
> > the design emerges: cache intensity. Because of XML namespaces I think  
> > it is necessary for the buffer to contain the current node plus all its  
> > parents.  
> 
> That might not scale well.  For instance, if you are accessing the 1500th  
> child element of a parent, doesn't that mean that the buffer must contain  
> the full text for the previous 1499 elements in order to also contain the  
> parent?
> 
> Maybe I'm misunderstanding what you mean.


Let's talk on an example:

<a name="value">
        <b>
                Some Text 1
                <c2>      <!-- HERE -->
                Some text 2
                </c2>
                Some Text 3
        </b>
</a>

The buffer of the iterator positioned HERE would be:

[Node a | Node b | Node c2]

Node c2 and all its parents are available for inspection. Node a's attribute is 
stored in the buffer, but not b's "Some text 1" as it is c2's sibling; "Some 
text 1" was available in the previous iteration, now it's overwritten by c2. To 
get to "Some text 2" let's advance the iterator in depth to get:

[Node a | Node b | Node c2 | Text node "Some text 2"]

Advancing it once more we get to:

[Node a | Node b | Text node "Some text 3"]

So "Some text 3" is written where c2 and the text node 2 used to be.

The element type of the range would always be the child, parents available 
through pointers:

foreach (node; xmlRange) {
        doStuff(node);
        if (Node* parent = node.parent)
                doOtherStuff(parent);
}

Having no access to siblings is quite limiting but the iterator can form an 
efficient (zero-allocation) basis on which more convenient schemes are built 
upon. It's still just brain-storming, though. I fear there's something that'll 
make the whole idea crash & burn.

> I would start out with a non-compliant parser, but one that allocates  
> nothing beyond the I/O buffer, one that simply parses lazily and can be  
> used as well as a SAX parser.  Then see how much extra allocations we need  
> to get it to be compliant.  Then, one can choose the compliancy level  
> based on what performance penalties one is willing to incur.

Yeah, 100% compliance is a long way.

-- 
Tomek

Re: High performance XML parser

Reply via email to