Re: High performance XML parser

spir Tue, 08 Feb 2011 17:47:59 -0800

On 02/09/2011 01:16 AM, Tomek Sowiński wrote:

Steven Schveighoffer napisał:

The design I'm thinking is that the node iterator will own a buffer. One
consequence is that the fields of the current node will point to the
buffer akin to foreach(line; File.byLine), so in order to lift the input
the user will have to dup (or process the node in-place). As new nodes
will be overwritten on the same piece of memory, an important trait of
the design emerges: cache intensity. Because of XML namespaces I think
it is necessary for the buffer to contain the current node plus all its
parents.


That might not scale well.  For instance, if you are accessing the 1500th
child element of a parent, doesn't that mean that the buffer must contain
the full text for the previous 1499 elements in order to also contain the
parent?

Maybe I'm misunderstanding what you mean.


Let's talk on an example:

<a name="value">
        <b>
                Some Text 1
                <c2>       <!-- HERE -->
                Some text 2
                </c2>
                Some Text 3
        </b>
</a>

The buffer of the iterator positioned HERE would be:

[Node a | Node b | Node c2]

Node c2 and all its parents are available for inspection. Node a's attribute is stored in the buffer, but not 
b's "Some text 1" as it is c2's sibling; "Some text 1" was available in the previous 
iteration, now it's overwritten by c2. To get to "Some text 2" let's advance the iterator in depth 
to get:

[Node a | Node b | Node c2 | Text node "Some text 2"]

Advancing it once more we get to:

[Node a | Node b | Text node "Some text 3"]

So "Some text 3" is written where c2 and the text node 2 used to be.

That's very similar to what I was thinking at. What I wonder is whether, inyour buffer representations above, 'Node x' represents an instanciated node, orcollected data necessary to later instanciate --once the current part of thesource is validated (proved correct). What i mean is, the whole <a> node willbe validated only when </a> is matched, so that if your parsing process failson the way (and/or it was just following a wrong parsing path), then all nodesinstanciated along tha way are just to throw away, aren't they (unless somememoisation may be useful).Possibly all what say here is just stupid, depending on the parsing algo andnature of the grammar, and also how costly Node creation is. In my case, all ofthis seems relevant. Thus, I'm thinking at just collecting data along the way(rather easy & far cheaper than node construction), and (recursively)instanciate only once a section is validated. (needs to be tried concretely--maybe there are issues I'm not yet aware of)


Denis
--
_________________
vita es estrany
spir.wikidot.com

Re: High performance XML parser

Reply via email to