On Sat, Mar 30, 2013 at 08:02:38AM +0100, Martin B. wrote:
> {Re-sending this. Never got anywhere it seems.}
>
> Hi!
>
> I currently have to fix an existing application to use something
> other than the DOM interface of libxml2 because it turns out it gets
> passed XML files so large that they can't be loaded into memory.
>
> I have rewritten the data loading from iterating over the DOM tree
> to using xmlTextReader for the most part now without too much
> problems.
>
> It turns out however, that the subtree where the large data resides
> has to be read not in-order, but I have to collect some (small
> amount of) data before the other. (And the problem is exactly that
> it is this subtree that contains the large volume of data, so
> loading only this subtree into memory doesn't make much sense
> either.)
>
> The easiest thing would be to just "clone" / "copy" my current
> reader, read ahead and then return to the original instance to
> continue reading there.
>
> There doesn't appear to be any way however to "copy" the state of an
> xmlTextReader.
>
The problem is that XML parsing is really defined as a sequential
operation. You can't really go backward or start only from a given
'index'. For cloning from a given point and continuing, the problem
is the I/O model. The parser can read from a filedescriptor or even
from a constructed I/O made of a set of callback functions. The only
way to do this would mean to keep all the input data processed from that
point until it gets consumed by the cloned parser. In most case though
the size of the data fed to the parser is nearly an order of magnitude
less than the memory used by the equivalent tree (depends a lot how
is your tree !) so that may still be a gain.
But by definition of parsing, the cloned will still have to go
though all the data from the cloning point, and the core of the issue
is that you can't always clone an I/O path.
IMHO if you're processing from a file, just reparse, parsing
can be extremely fast if you don't need to allocate a tree or data
as you go.
> If I can't re-read part of a file, I could also re-read the whole
> file, which, although wasteful, would be OK here, but I still would
> need to remember where I was beforehand?
>
> Is there maybe a simple way to remember for a xmlTextReader where it
> is in the current document, so that I can later find that position
> again when reading the document/file a second time?
Hum, no, on a tree I would have said use xmlGetNodePath(xmlNodePtr),
but it won't work on the reader as most of the tree is discarded.
You will iterate on the Read() though, assuming you don't do other
progress operations, just count them, and then when going through the
second time run a loop with the same number of Read() you should be
at the same place if the input didn't changed !
Daniel
--
Daniel Veillard | Open Source and Standards, Red Hat
[email protected] | libxml Gnome XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | virtualization library http://libvirt.org/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml