On 6/15/07, Ken Foskey <[EMAIL PROTECTED]> wrote:
On Fri, 2007-06-15 at 14:08 -0700, [EMAIL PROTECTED]
wrote:
> Sweet Chas!
>
> My two cents is to check out XML::Twig at http://www.xmltwig.org/ .
>
> It's easy to get a grip on and well suited to reading attributes as
> well as elements in elements.

Looks interesting,  would this reasonably handle 70,000 statements
containing LOTS of details in them?   I need to process this statement
by statement but it would totally blow memory.

--
Ken Foskey
FOSS developer

Yes, so long as you only need to deal with a part of it at a time you
can flush the parts that you are done with from memory.  I have used
XML::Twig to deal with files in the gigabyte range.

from perldoc XML::Twig
      Processing an XML document chunk by chunk

      One of the strengths of XML::Twig is that it let you work with files
      that do not fit in memory (BTW storing an XML document in memory as a
      tree is quite memory-expensive, the expansion factor being often around
      10).

      To do this you can define handlers, that will be called once a specific
      element has been completely parsed. In these handlers you can access
      the element and process it as you see fit, using the navigation and the
      cut-n-paste methods, plus lots of convenient ones like "prefix ".  Once
      the element is completely processed you can then "flush " it, which
      will output it and free the memory. You can also "purge " it if you
      don't need to output it (if you are just extracting some data from the
      document for example). The handler will be called again once the next
      relevant element has been parsed.

        my $t= XML::Twig->new( twig_handlers =>
                                { section => \&section,
                                  para   => sub { $_->set_tag( 'p');
                                },
                             );
        $t->parsefile( 'doc.xml');
        $t->flush; # don't forget to flush one last time in the end or anything
                   # after the last </section> tag will not be output

        # the handler is called once a section is completely parsed, ie when
        # the end tag for section is found, it receives the twig itself and
        # the element (including all its sub-elements) as arguments
        sub section
          { my( $t, $section)= @_;      # arguments for all twig_handlers
            $section->set_tag( 'div');  # change the tag name.4, my favourite m
ethod...
            # let's use the attribute nb as a prefix to the title
            my $title= $section->first_child( 'title'); # find the title
            my $nb= $title->{'att'}->{'nb'}; # get the attribute
            $title->prefix( "$nb - ");  # easy isn't it?
            $section->flush;            # outputs the section and frees memory
          }

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/


Reply via email to