Re: xml::twig help

2008-03-20 Thread mirod
ken Foskey wrote:
> Updated to fix memory problem, you have to purge.   Takes over 30
> minutes for 120K records.
> 
> I am sure that  the whole process can be done better with a good
> understanding of the module.  Will benchmark XML::Rules though.


Not knowing the structure of the XML you are processing makes it a bit
hard to give you advice, but descendants is an expensive method, so if
by any chance the elements you are interested in ( member, add1...) are
in known position in the mem element, then use first_child and the likes
to get to them. For example if member is a direct child of mem, then
my $mem_no = get_value( $mem_ref, 'member' );
can be replaced by
my $mem_no= $mem_ref->first_child( 'member')->text;
and even shorter:
my $mem_no= $mem_ref->field( 'member'); # which doesn't die if there is
# no member child

If you don't know where the elements are, at least you can use
first_descendant, to avoid going through the entire list of descendants
(it will stop once the first matching descendant is found).

OTH

-- 
mirod

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-18 Thread Rob Dixon

Ken Foskey wrote:
>

For the record on a more complex script than the address one...

xml:simple   7 hours plus on very quick machine, still running and
absolutely hammering the system, 1.3 Gig of memory used.

xml::twig  1 hour on laptop (underpowered and not much memory), Linux
still usable while running.

Definitely worth the time learning XML::Twig.


I have never found XML::Simple to be either simple or efficient. It's a
shame that it earns its reputation by seeming to be approachable.

My vote would be with XML::LibXML, but XML::Twig is pretty good.

Rob

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-18 Thread ken Foskey
For the record on a more complex script than the address one...

xml:simple   7 hours plus on very quick machine, still running and
absolutely hammering the system, 1.3 Gig of memory used.

xml::twig  1 hour on laptop (underpowered and not much memory), Linux
still usable while running.

Definitely worth the time learning XML::Twig.

-- 
Ken Foskey
FOSS developer


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-18 Thread Rob Dixon

Ken Foskey wrote:


On Tue, 2008-03-18 at 00:55 +1100, Ken Foskey wrote:

I am extracting addresses from an XML file to process through other
programs using pipe delimiter the following code works but this is going
to get 130,000 records through it it must be very efficient and I cannot
follow the documentation on the best way to do this.

After this simple one is programmed I have to change a much more complex
version of this program.

#!/usr/bin/perl -w
# vi:set sw=4 ts=4 et cin:
# $Id:$

=head1 SYNOPSIS

Extract addresses from an XML file into pipe delimited file.

   usage: address_extract.pl  xml_file

=cut

use warnings;
use strict;

use XML::Twig qw(:strict);

sub no_pipe
{
my $value = shift;

$value =~ s/\|//g;
return $value;
}

if( ! -f $ARGV[0] ) {
print "$ARGV[0] is not a filename, requires filename as first
parameter!\n";
}

my $sort;
my $sort_file = $ARGV[0].'.unsorted';
unlink $sort_file; # in case of rerun
open( $sort, '>', $sort_file  ) 
or die "Unable to open $sort_file for output $!";


my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) 
or die "Unable to open $ARGV[0] $!";


my $member = 0;

$ref->parsefile( $ARGV[0] );

sub get_value
{
my ($mem_ref, $key) = @_;
my @array = $mem_ref->descendants( $key );
return $array[0]->text();
}

sub member {

my ($twig, $mem_ref) = @_;

$member++;

my $mem_no = get_value( $mem_ref, 'member' );
my $add1   = get_value( $mem_ref, 'add1' );
my $add2   = get_value( $mem_ref, 'add2' );
my $add3   = get_value( $mem_ref, 'add3' );
my $suburb = get_value( $mem_ref, 'suburb' );
my $state  = get_value( $mem_ref, 'state' );
my $pcode  = get_value( $mem_ref, 'pcode' );

print $sort join( '|', $member,
 $mem_no,
 no_pipe( $add1 ),
 no_pipe( $add2 ),
 no_pipe( $add3 ),
 no_pipe( $suburb),
 $state,
 $pcode,
) ."\n";

$twig->purge;


return 1;
}

>
> Updated to fix memory problem, you have to purge.   Takes over 30
> minutes for 120K records.
>
> I am sure that  the whole process can be done better with a good
> understanding of the module.  Will benchmark XML::Rules though.

Hi Ken

OK so your query was about how to make your code run faster. 15ms per
record isn't bad, but I'm sure it would be faster if you junk your
get_value and no_pipe functions and rewrite member something like
this:

sub member {

my $mem_ref = pop;

$member++;

my @data = map $mem_ref->first_child_text($_),
qw/member add1 add2 add3 suburb state pcode/;

tr/|//d foreach @data;

print $sort join('|', @data), "\n";
}

Oh, and there's no point in the die call on XML::Twig->new as something
dire is wrong if this returns false, and anyway it's certainly not
because the input file can't be opened. The parsefile method will throw
its own die if it can't open the file, so all bases are covered already.

HTH,

Rob


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-17 Thread ken Foskey
Updated to fix memory problem, you have to purge.   Takes over 30
minutes for 120K records.

I am sure that  the whole process can be done better with a good
understanding of the module.  Will benchmark XML::Rules though.

On Tue, 2008-03-18 at 00:55 +1100, Ken Foskey wrote:
> I am extracting addresses from an XML file to process through other
> programs using pipe delimiter the following code works but this is going
> to get 130,000 records through it it must be very efficient and I cannot
> follow the documentation on the best way to do this.
> 
> After this simple one is programmed I have to change a much more complex
> version of this program.
> 
> #!/usr/bin/perl -w
> # vi:set sw=4 ts=4 et cin:
> # $Id:$
> 
> =head1 SYNOPSIS
> 
> Extract addresses from an XML file into pipe delimited file.
> 
>usage: address_extract.pl  xml_file
> 
> =cut
> 
> use warnings;
> use strict;
> 
> use XML::Twig qw(:strict);
> 
> sub no_pipe
> {
> my $value = shift;
> 
> $value =~ s/\|//g;
> return $value;
> }
> 
> if( ! -f $ARGV[0] ) {
> print "$ARGV[0] is not a filename, requires filename as first
> parameter!\n";
> }
> 
> my $sort;
> my $sort_file = $ARGV[0].'.unsorted';
> unlink $sort_file; # in case of rerun
> open( $sort, '>', $sort_file  ) 
> or die "Unable to open $sort_file for output $!";
> 
> my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) 
> or die "Unable to open $ARGV[0] $!";
> 
> my $member = 0;
> 
> $ref->parsefile( $ARGV[0] );
> 
> sub get_value
> {
> my ($mem_ref, $key) = @_;
> my @array = $mem_ref->descendants( $key );
> return $array[0]->text();
> }
> 
> sub member {
my ($twig, $mem_ref) = @_;
> $member++;
> 
> my $mem_no = get_value( $mem_ref, 'member' );
> my $add1   = get_value( $mem_ref, 'add1' );
> my $add2   = get_value( $mem_ref, 'add2' );
> my $add3   = get_value( $mem_ref, 'add3' );
> my $suburb = get_value( $mem_ref, 'suburb' );
> my $state  = get_value( $mem_ref, 'state' );
> my $pcode  = get_value( $mem_ref, 'pcode' );
> 
> print $sort join( '|', $member,
>  $mem_no,
>  no_pipe( $add1 ),
>  no_pipe( $add2 ),
>  no_pipe( $add3 ),
>  no_pipe( $suburb),
>  $state,
>  $pcode,
> ) ."\n";
$twig->purge;

> return 1;
> }
> 
-- 
Ken Foskey
FOSS developer


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-17 Thread Jenda Krynicky
From:   Ken Foskey <[EMAIL PROTECTED]>
> I am extracting addresses from an XML file to process through other
> programs using pipe delimiter the following code works but this is going
> to get 130,000 records through it it must be very efficient and I cannot
> follow the documentation on the best way to do this.
> 
> After this simple one is programmed I have to change a much more complex
> version of this program.
> 
> 
>
> my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) 
> or die "Unable to open $ARGV[0] $!";
> 
> my $member = 0;
> 
> $ref->parsefile( $ARGV[0] );
> 
> sub get_value
> {
> my ($mem_ref, $key) = @_;
> my @array = $mem_ref->descendants( $key );
> return $array[0]->text();
> }
> 
> sub member {
> my ($t, $mem_ref) = @_;
> $member++;
> 
> my $mem_no = get_value( $mem_ref, 'member' );
> my $add1   = get_value( $mem_ref, 'add1' );
> my $add2   = get_value( $mem_ref, 'add2' );
> my $add3   = get_value( $mem_ref, 'add3' );
> my $suburb = get_value( $mem_ref, 'suburb' );
> my $state  = get_value( $mem_ref, 'state' );
> my $pcode  = get_value( $mem_ref, 'pcode' );
> 
> print $sort join( '|', $member,
>  $mem_no,
>  no_pipe( $add1 ),
>  no_pipe( $add2 ),
>  no_pipe( $add3 ),
>  no_pipe( $suburb),
>  $state,
>  $pcode,
> ) ."\n";
> return 1;
> }

Looks like a perfect task for XML::Rules:

#!/usr/bin/perl
use strict;
use XML::Rules;

my $member;
my $parser = XML::Rules->new(
stripspaces => 7,
rules => {
_default => 'content',
mem => sub {
print join( '|', ++$member, map {(my $s = $_[1]->{$_}) 
=~ s/\|//; 
$s} qw(member add1 add2 add3 suburb state pcode)), "\n";
return;
}
},
);

$parser->parse(\*DATA);

__DATA__

   
 member
 add1
 add2
 add3
 suburb
 state
 pcode
   
   
 other
 ADD1
 ADD2
 ADD3
 suburb 2
 state
 pcode
   




I would expect this to be quicker than your XML::Twig solution, 
though I have to leave the benchmarking to you.

HTH, Jenda
= [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =
When it comes to wine, women and song, wizards are allowed 
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: xml::twig help

2008-03-17 Thread Jay Savage
On Mon, Mar 17, 2008 at 9:55 AM, Ken Foskey <[EMAIL PROTECTED]> wrote:
>
>  I am extracting addresses from an XML file to process through other
>  programs using pipe delimiter the following code works but this is going
>  to get 130,000 records through it it must be very efficient and I cannot
>  follow the documentation on the best way to do this.
>
>  After this simple one is programmed I have to change a much more complex
>  version of this program.
>
>  #!/usr/bin/perl -w
>  # vi:set sw=4 ts=4 et cin:
>  # $Id:$
>
>  =head1 SYNOPSIS
>
>  Extract addresses from an XML file into pipe delimited file.
>
>usage: address_extract.pl  xml_file
>
>  =cut
>
>  use warnings;
>  use strict;
>
>  use XML::Twig qw(:strict);
>
>  sub no_pipe
>  {
> my $value = shift;
>
> $value =~ s/\|//g;
> return $value;
>  }
>
>  if( ! -f $ARGV[0] ) {
> print "$ARGV[0] is not a filename, requires filename as first
>  parameter!\n";
>  }
>
>  my $sort;
>  my $sort_file = $ARGV[0].'.unsorted';
>  unlink $sort_file; # in case of rerun
>  open( $sort, '>', $sort_file  )
> or die "Unable to open $sort_file for output $!";
>
>  my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} )
> or die "Unable to open $ARGV[0] $!";
>
>  my $member = 0;
>
>  $ref->parsefile( $ARGV[0] );
>
>  sub get_value
>  {
> my ($mem_ref, $key) = @_;
> my @array = $mem_ref->descendants( $key );
> return $array[0]->text();
>  }
>
>  sub member {
> my ($t, $mem_ref) = @_;
> $member++;
>
> my $mem_no = get_value( $mem_ref, 'member' );
> my $add1   = get_value( $mem_ref, 'add1' );
> my $add2   = get_value( $mem_ref, 'add2' );
> my $add3   = get_value( $mem_ref, 'add3' );
> my $suburb = get_value( $mem_ref, 'suburb' );
> my $state  = get_value( $mem_ref, 'state' );
> my $pcode  = get_value( $mem_ref, 'pcode' );
>
> print $sort join( '|', $member,
>  $mem_no,
>  no_pipe( $add1 ),
>  no_pipe( $add2 ),
>  no_pipe( $add3 ),
>  no_pipe( $suburb),
>  $state,
>  $pcode,
> ) ."\n";
> return 1;
>  }
>
>

Ken,

If you're really worried about performance, then I would say two
places to look first would be all the temporary variables and
subroutine invocations. I don't know the ins and outs of XML::Twig, so
I don't really have any concrete advice--for instance, does
'$mem_ref->descendants( $key )->text()' work? some modules will parse
a structure like that, some won't--but in general, think about
creative ways you might use map to avoid three subroutine invocations
and (5? 6?) temporary variables for each element you process.
Something like the following should get you started down the path:

## Untested!

sub member {
my $t = shift;

print join "|",
map { s/\|//g }
map { $_[0]->descendants( $_ )->text() } qw/ member add1 add2 add3
suburb state pcode /;
}

That may not work out of the box depending on how deeply nested
XML::Twig's refs are, but hopefully you an see where I'm headed with
it.

Also, string concatenation is less efficient than adding another term
to prints argument list. i.e.: use ',' instead of '.' in print when
you can. and lastly, efficiency for matching is, IME, largely a matter
of system configuration and specific input data, but it's worth
benchmarking to see if substr() performs better than s/// in your
case. Sometimes you can get significant savings that way. sometimes
you can't, of course.

HTH,

-- jay
--
This email and attachment(s): [ ] blogable; [ x ] ask first; [ ]
private and confidential

daggerquill [at] gmail [dot] com
http://www.tuaw.com http://www.downloadsquad.com http://www.engatiki.org

values of β will give rise to dom!


Re: xml::twig help

2008-03-17 Thread Rob Dixon

Ken Foskey wrote:

I am extracting addresses from an XML file to process through other
programs using pipe delimiter the following code works but this is going
to get 130,000 records through it it must be very efficient and I cannot
follow the documentation on the best way to do this.

After this simple one is programmed I have to change a much more complex
version of this program.

#!/usr/bin/perl -w
# vi:set sw=4 ts=4 et cin:
# $Id:$

=head1 SYNOPSIS

Extract addresses from an XML file into pipe delimited file.

   usage: address_extract.pl  xml_file

=cut

use warnings;
use strict;

use XML::Twig qw(:strict);

sub no_pipe
{
my $value = shift;

$value =~ s/\|//g;
return $value;
}

if( ! -f $ARGV[0] ) {
print "$ARGV[0] is not a filename, requires filename as first
parameter!\n";
}

my $sort;
my $sort_file = $ARGV[0].'.unsorted';
unlink $sort_file; # in case of rerun
open( $sort, '>', $sort_file  ) 
or die "Unable to open $sort_file for output $!";


my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) 
or die "Unable to open $ARGV[0] $!";


my $member = 0;

$ref->parsefile( $ARGV[0] );

sub get_value
{
my ($mem_ref, $key) = @_;
my @array = $mem_ref->descendants( $key );
return $array[0]->text();
}

sub member {
my ($t, $mem_ref) = @_;
$member++;

my $mem_no = get_value( $mem_ref, 'member' );
my $add1   = get_value( $mem_ref, 'add1' );
my $add2   = get_value( $mem_ref, 'add2' );
my $add3   = get_value( $mem_ref, 'add3' );
my $suburb = get_value( $mem_ref, 'suburb' );
my $state  = get_value( $mem_ref, 'state' );
my $pcode  = get_value( $mem_ref, 'pcode' );

print $sort join( '|', $member,
 $mem_no,
 no_pipe( $add1 ),
 no_pipe( $add2 ),
 no_pipe( $add3 ),
 no_pipe( $suburb),
 $state,
 $pcode,
) ."\n";
return 1;
}


What is your question? I ran you program against XML data like


  
member
add1
add2
add3
suburb
state
pcode
  


and it seemed to work fine.

I would only change it cosmetically, for instance it would be nicer to
pull the first command-line parameter off into a variable, and you need
to die if there isn't one, not just print a message and carry on.

my $file = shift;

unless ($file and -f $file) {
  die "$file is not a filename, requires filename as first parameter!\n";
}

also, you could replace all your calls to get_value with

my $mem_no = $mem_ref->first_child('member')->text;

and so on, but there seems to be no problem with the basic
functionality. Let us know if you need any further help.

HTH,

Rob



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




xml::twig help

2008-03-17 Thread Ken Foskey

I am extracting addresses from an XML file to process through other
programs using pipe delimiter the following code works but this is going
to get 130,000 records through it it must be very efficient and I cannot
follow the documentation on the best way to do this.

After this simple one is programmed I have to change a much more complex
version of this program.

#!/usr/bin/perl -w
# vi:set sw=4 ts=4 et cin:
# $Id:$

=head1 SYNOPSIS

Extract addresses from an XML file into pipe delimited file.

   usage: address_extract.pl  xml_file

=cut

use warnings;
use strict;

use XML::Twig qw(:strict);

sub no_pipe
{
my $value = shift;

$value =~ s/\|//g;
return $value;
}

if( ! -f $ARGV[0] ) {
print "$ARGV[0] is not a filename, requires filename as first
parameter!\n";
}

my $sort;
my $sort_file = $ARGV[0].'.unsorted';
unlink $sort_file; # in case of rerun
open( $sort, '>', $sort_file  ) 
or die "Unable to open $sort_file for output $!";

my $ref = XML::Twig->new( twig_handlers=>{mem=>\&member} ) 
or die "Unable to open $ARGV[0] $!";

my $member = 0;

$ref->parsefile( $ARGV[0] );

sub get_value
{
my ($mem_ref, $key) = @_;
my @array = $mem_ref->descendants( $key );
return $array[0]->text();
}

sub member {
my ($t, $mem_ref) = @_;
$member++;

my $mem_no = get_value( $mem_ref, 'member' );
my $add1   = get_value( $mem_ref, 'add1' );
my $add2   = get_value( $mem_ref, 'add2' );
my $add3   = get_value( $mem_ref, 'add3' );
my $suburb = get_value( $mem_ref, 'suburb' );
my $state  = get_value( $mem_ref, 'state' );
my $pcode  = get_value( $mem_ref, 'pcode' );

print $sort join( '|', $member,
 $mem_no,
 no_pipe( $add1 ),
 no_pipe( $add2 ),
 no_pipe( $add3 ),
 no_pipe( $suburb),
 $state,
 $pcode,
) ."\n";
return 1;
}