Re: parsing HTML

Randy W. Sims Wed, 21 Jul 2004 21:00:29 -0700

On 7/21/2004 11:24 PM, Andrew Gaffney wrote:

Randy W. Sims wrote:
On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to:

my $html = [ { tag => 'table', id => 'maintable', width => 300, content => [ { tag => 'tr', content => [ { tag => 'td', width => 200, content => "some content" }, { tag => 'td', width => 100, content => "more content" } ] ] ]; # Not tested, but you get the idea


[snip]

I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser?

Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack:

#!/usr/bin/perl
package SampleParser;

use strict;

use HTML::Parser;
use base qw(HTML::Parser);

sub start {
    my($self, $tagname, $attr, $attrseq, $origtext) = @_;
    my $stack = $self->{_stack};
    my $depth = $stack ? @$stack : 0;
    print ' ' x $depth, "<$tagname>\n";
    push @{$self->{_stack}}, ' ';
}

sub end {
    my($self, $tagname, $origtext) = @_;
    pop @{$self->{_stack}};
    my $stack = $self->{_stack};
    my $depth = $stack ? @$stack : 0;
    print ' ' x $depth, "<\\$tagname>\n";
}

1;

package main;

use strict;
use warnings;

my $p = SampleParser->new();
$p->parse_file(\*DATA);

__DATA__
<html>
<head>
<title>Title</title>
<body>
The body.
</body>
</html>

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: parsing HTML

Reply via email to