On 6/15/07, Ken Foskey <[EMAIL PROTECTED]> wrote:
On Fri, 2007-06-15 at 14:08 -0700, [EMAIL PROTECTED]
wrote:
> Sweet Chas!
>
> My two cents is to check out XML::Twig at http://www.xmltwig.org/ .
>
> It's easy to get a grip on and well suited to reading attributes as
> well as elements in elements.
Looks interesting, would this reasonably handle 70,000 statements
containing LOTS of details in them? I need to process this statement
by statement but it would totally blow memory.
--
Ken Foskey
FOSS developer
Yes, so long as you only need to deal with a part of it at a time you
can flush the parts that you are done with from memory. I have used
XML::Twig to deal with files in the gigabyte range.
from perldoc XML::Twig
Processing an XML document chunk by chunk
One of the strengths of XML::Twig is that it let you work with files
that do not fit in memory (BTW storing an XML document in memory as a
tree is quite memory-expensive, the expansion factor being often around
10).
To do this you can define handlers, that will be called once a specific
element has been completely parsed. In these handlers you can access
the element and process it as you see fit, using the navigation and the
cut-n-paste methods, plus lots of convenient ones like "prefix ". Once
the element is completely processed you can then "flush " it, which
will output it and free the memory. You can also "purge " it if you
don't need to output it (if you are just extracting some data from the
document for example). The handler will be called again once the next
relevant element has been parsed.
my $t= XML::Twig->new( twig_handlers =>
{ section => \§ion,
para => sub { $_->set_tag( 'p');
},
);
$t->parsefile( 'doc.xml');
$t->flush; # don't forget to flush one last time in the end or anything
# after the last </section> tag will not be output
# the handler is called once a section is completely parsed, ie when
# the end tag for section is found, it receives the twig itself and
# the element (including all its sub-elements) as arguments
sub section
{ my( $t, $section)= @_; # arguments for all twig_handlers
$section->set_tag( 'div'); # change the tag name.4, my favourite m
ethod...
# let's use the attribute nb as a prefix to the title
my $title= $section->first_child( 'title'); # find the title
my $nb= $title->{'att'}->{'nb'}; # get the attribute
$title->prefix( "$nb - "); # easy isn't it?
$section->flush; # outputs the section and frees memory
}
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/