Hi everybody, I have already posted on comp.lang.perl.modules, but if the moderator agrees, I think this would be on topic for this newsgroup as well.
I have released XML::Reader (ver 0.34) http://search.cpan.org/~keichner/XML-Reader-0.34/lib/XML/Reader.pm To explain the module, I have created a small demonstration program that extracts XML-subtrees (for example any path that ends with '/.../ a') memory efficiently. An XML document can be very large (possibly many gigabytes), but is composed of XML-subtrees, each of which is only a few kilobytes in size. The demonstration program reads XML-subtrees one by one, only the memory for one subtree is held at a time. Each subtree can then be processed further at your convenience (for example by using regular expressions, or, by using other XML-Modules, such as XML::Simple). In principle, XML::Reader has no event driven callback functions, you have to loop over the XML-document yourself and the resulting XML- subtree is represented in text format. Any question, suggestions, feedback are most welcome ! Here is my demonstration program: use strict; use warnings; use XML::Reader 0.34; use LWP::Simple; use XML::Simple; use Data::Dumper; my $addr = 'http://www.w3.org/TR/xhtml1'; print "reading $addr...\n"; my $content = get $addr or die "Error-0010: Can't get address '$addr'"; print "\n"; { my $rdr = XML::Reader->newhd(\$content, { filter => 5 }, { root => '/html/body/dl/dt', branch => '*' }, ) or die "Error-0030: Can't X::R->new() because $!"; my $i; while ($rdr->iterate) { $i++; my $xml = $rdr->rval; printf "<dt1> %3d. %s\n", $i, $xml; } print "\n"; } { my $rdr = XML::Reader->newhd(\$content, { filter => 5 }, { root => '/html/body/dl/dt', branch => '*' }, ) or die "Error-0020: Can't X::R->new() because $!"; my $i; while ($rdr->iterate) { $i++; my $xml = $rdr->rval; my $ref = XMLin($xml); my $dmp = Dumper($ref); $dmp =~ s{\s}''xmsg; $dmp =~ s{\$VAR1=}''xms; printf "<dt2> %3d. %s\n", $i, $dmp; } print "\n"; } { my $rdr = XML::Reader->newhd(\$content, { filter => 5 }, { root => '//a', branch => ['/', '/@href'] }, ) or die "Error-0040: Can't X::R->new() because $!"; my $i; while ($rdr->iterate) { my ($text, $href) = $rdr->rval; next unless defined $href; my $stem = $rdr->rstem; $i++; for ($text, $href) { $_ = '' unless defined $_; } printf "<a> %3d. %-35s: %-18.18s href=%s\n", $i, $stem, $text, $href; } print "\n"; } { my $rdr = XML::Reader->newhd(\$content, { filter => 5 }, { root => '//img', branch => ['/@src', '/@height', '/@width'] }, ) or die "Error-0040: Can't X::R->new() because $!"; my $i; while ($rdr->iterate) { my ($src, $height, $width) = $rdr->rval; $i++; for ($src, $height, $width) { $_ = '' unless defined $_; } printf "<img> %3d. src=%-40s h=%-4s w=%s\n", $i, $src, $height, $width; } print "\n"; } And here is an extract from the output: reading http://www.w3.org/TR/xhtml1 ... [...] <dt1> 20. <dt><code class='tag'>a</code></dt> <dt1> 21. <dt><code class='tag'>pre</code></dt> <dt1> 22. <dt><code class='tag'>button</code></dt> <dt1> 23. <dt><code class='tag'>label</code></dt> [...] <dt2> 20. {'code'=>{'content'=>'a','class'=>'tag'}}; <dt2> 21. {'code'=>{'content'=>'pre','class'=>'tag'}}; <dt2> 22. {'code'=>{'content'=>'button','class'=>'tag'}}; <dt2> 23. {'code'=>{'content'=>'label','class'=>'tag'}}; [...] <a> 43. /html/body/div/ul/li/a : Acknowledgements href=#acks <a> 44. /html/body/div/ul/li/a : References href=#refs <a> 45. /html/body/div/ul/li/a : What is XHTML? href=#xhtml <a> 46. /html/body/div/ul/li/ul/li/a : What is HTML 4? href=#html4 <a> 47. /html/body/div/ul/li/ul/li/a : What is XML? href=#xml [...] <img> 1. src=http://www.w3.org/Icons/w3c_home h=48 w=72 <img> 2. src=http://www.w3.org/WAI/wcag1AAA.png h=32 w=88