Re: HTML::Parser walkthrough?

Ed Summers Wed, 28 Jan 2004 21:09:13 -0800

Hi Ben:

On Mon, Jan 26, 2004 at 04:47:17PM -0500, Ben Ostrowsky wrote:
> HTML > BODY > TABLE[3] > TR[7] > TD[1]


HTML::Parser is quite complex, and like Chuck said you need to get your
head around callbacks to start working with it. Callbacks are a really handy 
technique to use, and as Chuck also said they are not confined to just
parsing HTML.

You may want to also look at HTML::Tree [1] and it's associated modules.
HTML::Tree allows you to build an in memory data structure of the HTML 
page...kinda like a Perlish Document Object Model (DOM). Once you've
got the page in memory you can dig in to the place that you are interested
in, and extract the value. Here's an example, retrieving a fictitious page
from NOAA.

    use strict;
    use warnings;
    use HTML::TreeBuilder;
    use LWP::Simple;

    my $html = get( 'http://www.noaa.gov/ben.html' );
    my $tree = HTML::TreeBuilder->new_from_content( $html );
    my $body = $tree->look_down( _tag => 'body' );

    ## get the third <table> 
    my $count = 0;
    my $table;
    foreach my $element ( $body->content_list() ) {
        $count++ if ( $element->tag() eq 'table' );
        if ( $count == 3 ) {
            $table = $element;
            last;
        }
    }

    ## get the 7th <tr> 
    $count = 0;
    my $row;
    foreach my $element ( $table->content_list() ) {
        $count++ if ( $element->tag eq 'tr' );
        if ( $count == 7 ) { 
            $row = $element;
            last;
        }
    }

    ## extract the first <td>
    my ( $td ) = $row->content_list();

    ## and print it!
    print $td->as_text();

There would need to be some error checking in here to make sure that we
are really getting the table, tr and td elements of course before 
calling methods on them :)

Sean Burke is the current maintainer of HTML::Parser and HTML::Tree, and has
written some good articles on parsing HTML, a few of which are included in the
HTML::Tree distribution [2,3,4]. If you really get interested you can
buy (or perhaps check out :) his book Perl & LWP [5] which has lots of good
info on parsing HTML. Strongly recommended!

//Ed

[1] http://search.cpan.org/perldoc?HTML::TreeBuilder
[2] search.cpan.org/perldoc?HTML::Tree::AboutObjects
[3] search.cpan.org/perldoc?HTML::Tree::AboutTrees
[4] search.cpan.org/perldoc?HTML::Tree::Scanning
[5] http://www.oreilly.com/catalog/perllwp/

//Ed

-- 
Ed Summers
aim: inkdroid
web: http://www.inkdroid.org

The best writing is rewriting. [E. B. White]

Re: HTML::Parser walkthrough?

Reply via email to