tracking where I am in a tree structure (was: Re: parsing HTML)

Andrew Gaffney Thu, 22 Jul 2004 11:58:36 -0700

Andrew Gaffney wrote:

Andrew Gaffney wrote:
Randy W. Sims wrote:
On 7/21/2004 11:24 PM, Andrew Gaffney wrote:
Randy W. Sims wrote:
On 7/21/2004 10:42 PM, Andrew Gaffney wrote:
I am trying to build a HTML editor for use with my HTML::Mason site. I intend for it to support nested tables, SPANs, and anchors. I am looking for a module that can help me parse existing HTML (custom or generated by my scripts) into a tree structure similar to:

my $html = [ { tag => 'table', id => 'maintable', width => 300, content => [ { tag => 'tr', content => [ { tag => 'td', width => 200, content => "some content" }, { tag => 'td', width => 100, content => "more content" } ] ] ]; # Not tested, but you get the idea
[snip]
I'd rather generate a structure similar to what I have above instead of having a large tree of class objects that takes up more RAM and is probably slower. How would I go about generating a structure such as that above using HTML::Parser?
Parsers like HTML::Parser scan a document and upon encountering certain tokens fire off events. In the case of HTML::Parser, events are fired when encountering a start tag, the text between tags, and at the end tag. If you have an arbitrarily deep document structure like HTML, you can store the structure using a stack:
<SNIP>
Thanks. In the time it took you to put that together, I came up with the following to figure out how HTML::Parser works. I'll use your code to expand upon it.
<SNIP>
Here is my current working code. Please take a look at it and see if there are any obvious (or not so obvious) problems. I thought this would end up being far more difficult.
parsehtml.pl
============
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser ();
my $htmltree = [ { tag => 'document', content => [] } ];
my $node = $htmltree->[0]->{content};
my @prevnodes = ($htmltree);
sub start {
  my $tagname = shift;
  my $attr = shift;
  my $newnode = {};
  $newnode->{tag} = $tagname;
  foreach my $key(keys %{$attr}) {
    $newnode->{$key} = $attr->{$key};
  }
  $newnode->{content} = [];
  push @prevnodes, $node;
  push @{$node}, $newnode;
  $node = $newnode->{content};
}
sub end {
  my $tagname = shift;
  $node = pop @prevnodes;
}
sub text {
  my $text = shift;
  chomp $text;
  if($text ne '') {
    push @{$node}, $text;
  }
}
my $p = HTML::Parser->new( api_version => 3,
                           start_h => [\&start, "tagname, attr"],
                           end_h   => [\&end,   "tagname"],
                           text_h  => [\&text,  "dtext"] );
$p->parse_file("test.html");
use Data::Dumper;
print Dumper $htmltree;
test.html
=========
<table id="maintable" width="300">
<tr>
<td width="200">some content</td>
<td width="100">more content</td>
</tr>
</table>

Now for the next challenge. I need to be able to know where I am in the tree structure for any node that I am in while I am walking it. I will pass along a value via CGI in the form of '0.0.2.1.2' which another script will translate as '$htmltree->[0]->{content}->[0]->{content}->[2]->{content}->[1]->{content}->[2]'. Using the above code, and the following code I wrote for walking the tree and generating HTML from it, how can I mark each outputted HTML tag with its position in the tree?

sub descend_htmltree {
  my $node = shift;
  my $withclickiness = shift || 0;

foreach my $tmpnode (@{$node}) { if(ref($tmpnode) eq 'HASH') { my $nodeid = ""; # Magic code to generate node's position in tree $htmloutput .= "<div style='border: thin solid #bbbbbb' onDblClick=\"alert('you clicked $nodeid')\">" if($withclickiness); $htmloutput .= "<$tmpnode->{tag}"; foreach(keys %{$tmpnode}) { $htmloutput .= " $_=\"$tmpnode->{$_}\"" if($_ ne 'tag' && $_ ne 'content'); } $htmloutput .= ">"; descend_htmltree($tmpnode->{content}); $htmloutput .= "</$tmpnode->{tag}>"; $htmloutput .= "</div>" if($withclickiness); } else { $htmloutput .= "$tmpnode"; } } }

sub htmltree_to_html {
  my $filename = shift || '';
  my $withclickiness = shift || 0;

  descend_htmltree($htmltree->[0]->{content}, $withclickiness);
  if($filename ne '') {
    open HTML, "> $filename" or die "Can't open $filename for HTML output";
    print HTML $htmloutput;
    close HTML;
  }

  return $htmloutput;
}

--
Andrew Gaffney
Network Administrator
Skyline Aeronautics, LLC.
636-357-1548


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

tracking where I am in a tree structure (was: Re: parsing HTML)

Reply via email to