XML::Reader

Klaus Wed, 28 Apr 2010 01:52:55 -0700

Hi everybody,

I have already posted on comp.lang.perl.modules, but if the moderator
agrees, I think this would be on topic for this newsgroup as well.


I have released XML::Reader (ver 0.34)
http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm

To explain the module, I have created a small demonstration program
that extracts XML-subtrees (for example any path that ends with '/.../
a') memory efficiently.

An XML document can be very large (possibly many gigabytes), but is
composed of XML-subtrees, each of which is only a few kilobytes in
size. The demonstration program reads XML-subtrees one by one, only
the memory for one subtree is held at a time. Each subtree can then be
processed further at your convenience (for example by using regular
expressions, or, by using other XML-Modules, such as XML::Simple). In
principle, XML::Reader has no event driven callback functions, you
have to loop over the XML-document yourself and the resulting XML-
subtree is represented in text format.

Any question, suggestions, feedback are most welcome !

Here is my demonstration program:

use strict;
use warnings;
use XML::Reader 0.34;

use LWP::Simple;
use XML::Simple;
use Data::Dumper;

my $addr = 'http://www.w3.org/TR/xhtml1';

print "reading $addr...\n";
my $content = get $addr
  or die "Error-0010: Can't get address '$addr'";

print "\n";

{
    my $rdr = XML::Reader->newhd(\$content,
      { filter => 5 },
      { root => '/html/body/dl/dt', branch => '*' },
    ) or die "Error-0030: Can't X::R->new() because $!";

    my $i;
    while ($rdr->iterate) { $i++;
        my $xml = $rdr->rval;

        printf "<dt1> %3d. %s\n", $i, $xml;
    }
    print "\n";
}

{
    my $rdr = XML::Reader->newhd(\$content,
      { filter => 5 },
      { root => '/html/body/dl/dt', branch => '*' },
    ) or die "Error-0020: Can't X::R->new() because $!";

    my $i;
    while ($rdr->iterate) { $i++;
        my $xml = $rdr->rval;
        my $ref = XMLin($xml);
        my $dmp = Dumper($ref);

        $dmp =~ s{\s}''xmsg;
        $dmp =~ s{\$VAR1=}''xms;

        printf "<dt2> %3d. %s\n", $i, $dmp;
    }
    print "\n";
}

{
    my $rdr = XML::Reader->newhd(\$content,
      { filter => 5 },
      { root => '//a', branch => ['/', '/@href'] },
    ) or die "Error-0040: Can't X::R->new() because $!";

    my $i;
    while ($rdr->iterate) {
        my ($text, $href) = $rdr->rval;
        next unless defined $href;

        my $stem = $rdr->rstem;

        $i++;
        for ($text, $href) {
            $_ = '' unless defined $_;
        }

        printf "<a>   %3d. %-35s: %-18.18s href=%s\n",
          $i, $stem, $text, $href;
    }
    print "\n";
}

{
    my $rdr = XML::Reader->newhd(\$content,
      { filter => 5 },
      { root   => '//img',
        branch => ['/@src', '/@height', '/@width'] },
    ) or die "Error-0040: Can't X::R->new() because $!";

    my $i;
    while ($rdr->iterate) {
        my ($src, $height, $width) = $rdr->rval;

        $i++;
        for ($src, $height, $width) {
            $_ = '' unless defined $_;
        }

        printf "<img> %3d. src=%-40s h=%-4s w=%s\n",
          $i, $src, $height, $width;
    }
    print "\n";
}

And here is an extract from the output:

reading http://www.w3.org/TR/xhtml1 ...
[...]
<dt1>  20. <dt><code class='tag'>a</code></dt>
<dt1>  21. <dt><code class='tag'>pre</code></dt>
<dt1>  22. <dt><code class='tag'>button</code></dt>
<dt1>  23. <dt><code class='tag'>label</code></dt>
[...]
<dt2>  20. {'code'=>{'content'=>'a','class'=>'tag'}};
<dt2>  21. {'code'=>{'content'=>'pre','class'=>'tag'}};
<dt2>  22. {'code'=>{'content'=>'button','class'=>'tag'}};
<dt2>  23. {'code'=>{'content'=>'label','class'=>'tag'}};
[...]
<a>    43. /html/body/div/ul/li/a             : Acknowledgements
href=#acks
<a>    44. /html/body/div/ul/li/a             : References
href=#refs
<a>    45. /html/body/div/ul/li/a             : What is XHTML?
href=#xhtml
<a>    46. /html/body/div/ul/li/ul/li/a       : What is HTML 4?
href=#html4
<a>    47. /html/body/div/ul/li/ul/li/a       : What is XML?
href=#xml
[...]
<img>   1. src=http://www.w3.org/Icons/w3c_home         h=48   w=72
<img>   2. src=http://www.w3.org/WAI/wcag1AAA.png       h=32   w=88

XML::Reader

Reply via email to