Re: TableContentParser output question

Rob Dixon Fri, 17 Nov 2006 17:10:47 -0800

Dennis Bourn wrote:


Im trying to hack together a perl script to screen scrape some data from
a table on a webpage and enter that data into a MySQL database.
This would be my first attempt using perl and HTML::TableContentParser.

The following script was created using bits and pieces ive found on
various perl examples on the web;
-------------------------------
#!/usr/bin/perl
#use strict;
use  lib '/opt/local/lib/perl5/vendor_perl/5.8.6/';
use HTML::TableContentParser;
my $url = 
'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383';

 use LWP::Simple;
 my $content = get $url;
 die "Couldn't get $url" unless defined $content;
 $p = HTML::TableContentParser->new();
my $tables = $p->parse($content);
for $t (@$tables) {
  for $r (@{$t->{rows}}) {
                       print "Row: ";
     for $c (@{$r->{cells}}) {
       print "[$c->{data}] ";
}                             print "\n";                         }
 }
----------------------------------
My question is how do I refer to a specific entry,.. such as table 1 row
2 tabledata 2 without the loop?

If you were to look at the web page im scraping from you can see its
data on an oil well,.. I am only interested in the first 4 tables. I
want to set variables to each entry (my $serial =) so i can eaisly get
them into a database.

Does anyone have any insight that might help me out?


Hi Dennis

Well I was in the process of writing code to use HTML::TableContentParser just
to prove that you shouldn't use it when the module shot itself in the foot. If
you write

my $tables = $p->parse($content);

use Data::Dumper;
print Dumper $tables;

then you will see that it has lost the data for the third table altogether (the
latitude and longitude). Checking the HTML reveals that this is because that
table has a missing <tr> tag which confuses parser. Much better to use
HTML::TableExtract which, although not perfect, has a better pedigree and is
fine for this purpose. It's also much better at handling incorrect HTML.

The program below parses the HTML, then dumps the data with the tables_dump
method. The output from this alone may be adequate for you. It then goes on to
push all the headers and data from the first four tables onto two arrays and
then print those formatted in parallel. It seems to do what you want.

Some comments on your own code though. *Never* give up and comment out 'use
strict' - it exists to help you by saving you from yourself and removing it is
much like disabling your smoke alarm so that it doesn't make a noise when you
burn the toast. Secondly, your 'use lib' statement looks suspicious. It's
occasionally necessary to point to a separate directory for a development
version of a library, but this is a public release which should have been
installed somewhere in one of the include paths. Again, fix the problem rather
than making it work anyway.

I hope this helps.

Rob



use strict;
use warnings;

use LWP::Simple;
use HTML::TableExtract;
use List::Util qw/max/;

my $url = 
'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383';

my $content = get $url or die "Couldn't get $url";

my $htex = HTML::TableExtract->new;
$htex->parse($content);

print $htex->tables_dump(1);
print "\n\n";

my @tables = $htex->tables;

my (@header, @data);

foreach my $table (@tables[0..3]) {
 push @header, $table->row(0);
 push @data, $table->row(1);
}

my $len = max map length, @header;

my $i = 0;
foreach my $head (@header) {
 printf "%-*s = %s\n", $len, $head, $data[$i++];
}



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: TableContentParser output question

Reply via email to