Dennis Bourn wrote:
Im trying to hack together a perl script to screen scrape some data from a table on a webpage and enter that data into a MySQL database. This would be my first attempt using perl and HTML::TableContentParser. The following script was created using bits and pieces ive found on various perl examples on the web; ------------------------------- #!/usr/bin/perl #use strict; use lib '/opt/local/lib/perl5/vendor_perl/5.8.6/'; use HTML::TableContentParser; my $url = 'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383'; use LWP::Simple; my $content = get $url; die "Couldn't get $url" unless defined $content; $p = HTML::TableContentParser->new(); my $tables = $p->parse($content); for $t (@$tables) { for $r (@{$t->{rows}}) { print "Row: "; for $c (@{$r->{cells}}) { print "[$c->{data}] "; } print "\n"; } } ---------------------------------- My question is how do I refer to a specific entry,.. such as table 1 row 2 tabledata 2 without the loop? If you were to look at the web page im scraping from you can see its data on an oil well,.. I am only interested in the first 4 tables. I want to set variables to each entry (my $serial =) so i can eaisly get them into a database. Does anyone have any insight that might help me out?
Hi Dennis Well I was in the process of writing code to use HTML::TableContentParser just to prove that you shouldn't use it when the module shot itself in the foot. If you write my $tables = $p->parse($content); use Data::Dumper; print Dumper $tables; then you will see that it has lost the data for the third table altogether (the latitude and longitude). Checking the HTML reveals that this is because that table has a missing <tr> tag which confuses parser. Much better to use HTML::TableExtract which, although not perfect, has a better pedigree and is fine for this purpose. It's also much better at handling incorrect HTML. The program below parses the HTML, then dumps the data with the tables_dump method. The output from this alone may be adequate for you. It then goes on to push all the headers and data from the first four tables onto two arrays and then print those formatted in parallel. It seems to do what you want. Some comments on your own code though. *Never* give up and comment out 'use strict' - it exists to help you by saving you from yourself and removing it is much like disabling your smoke alarm so that it doesn't make a noise when you burn the toast. Secondly, your 'use lib' statement looks suspicious. It's occasionally necessary to point to a separate directory for a development version of a library, but this is a public release which should have been installed somewhere in one of the include paths. Again, fix the problem rather than making it work anyway. I hope this helps. Rob use strict; use warnings; use LWP::Simple; use HTML::TableExtract; use List::Util qw/max/; my $url = 'http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellinfo2?p_wsn=8383'; my $content = get $url or die "Couldn't get $url"; my $htex = HTML::TableExtract->new; $htex->parse($content); print $htex->tables_dump(1); print "\n\n"; my @tables = $htex->tables; my (@header, @data); foreach my $table (@tables[0..3]) { push @header, $table->row(0); push @data, $table->row(1); } my $len = max map length, @header; my $i = 0; foreach my $head (@header) { printf "%-*s = %s\n", $len, $head, $data[$i++]; } -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>