For a while now, people have been asking for a way to deal with
comments (at least) in Element and TreeBuilder; and so in the latest
HTML::Element, I added experimental and undocumented support for
comments and other non-element non-text things that can appear in an
HTML document.
I'll now describe the current state of these for the benefit of the
list, in the hopes that folks on the list might experiment with it,
and point out changes that should be made before I actually document
this (and thereby commit to the way this works).
HTML::Element as of 1.53 (most recent release of HTML-Tree) supports
storing comments, as elements of tagname "~comment"; declarations
with tagname "~declaration"; and PIs with tagname "~pi". Moreover,
literal content (i.e., things you don't want ampersand encoded or
any such thing) can be included thru an element of tagname
"~literal". (You can tell I've chosen "~" as the prefix character
because, as far as I can tell, it's quite impossible for that character
to occur in the tagname/GI of any HTML/SGML/XML element.)
The code for Element's starttag method should illustrates how this
works:
sub starttag
{
my($self, $entities) = @_;
my $name = $self->{'_tag'};
# TODO: document these...
return $self->{'text'} if $name eq '~literal';
return "<!" . $self->{'text'} . ">" if $name eq '~declaration';
return "<?" . $self->{'text'} . "?>" if $name eq '~pi';
if($name eq '~comment') {
if(ref($self->{'text'} || '') eq 'ARRAY') {
return
"<!" .
join(' ', map("--$_--", @{$self->{'text'}}))
. ">"
;
} else {
return "<!--" . $self->{'text'} . "-->"
}
}
[end of new special cases]
my $tag = $html_uc ? "<\U$name" : "<\L$name";
my $val;
for (sort keys %$self) { # predictable ordering
next if m/^_/s;
$val = $self->{$_};
# Hm -- what to do if val is undef?
# I suppose that shouldn't ever happen.
if ($_ eq $val && # if attribute is boolean, for this element
exists($boolean_attr{$name}) &&
(ref($boolean_attr{$name}) ? $boolean_attr{$name}{$_} :
$boolean_attr{$name} eq $_)
) {
$tag .= $html_uc ? " \U$_" : " \L$_";
} else { # non-boolean attribute
if ($val !~ m/^[0-9]+$/s) { # quote anything not purely numeric
# Might as well double-quote everything, for simplicity's sake
HTML::Entities::encode_entities($val, $entities);
$val = qq{"$val"};
}
$tag .= $html_uc ? qq{ \U$_\E=$val} : qq{ \L$_\E=$val};
}
}
"$tag>";
}
Now, this-all explains how the comments, PIs, and declarations exist.
I've fiddled with TreeBuilder to support catching the signals that
Parser sends when it actually sees these things (instead of just
ignoring them, as it does currently).
TreeBuilder's new code (v2.96) for doing that looks like this
(omitting some $Debug code):
# TODO: test whether comment(), declaration(), and process(), do the right
# thing as far as tightening and whatnot.
# Also, currently, doctypes and comments that appear before head or body
# show up in the tree in the wrong place. Something should be done about
# this. Tricky. Maybe this whole business of pre-making the body and
# whatnot is wrong.
sub comment {
#TODO: document this
return unless $_[0]->{'_store_comments'};
my($self, $text) = @_;
my $pos = $self->{'_pos'} || $self;
(my $e = HTML::Element->new('~comment'))->{'text'} = $text;
$pos->push_content($e);
return;
}
#==========================================================================
sub declaration {
#TODO: document this
return unless $_[0]->{'_store_declarations'};
my($self, $text) = @_;
my $pos = $self->{'_pos'} || $self;
(my $e = HTML::Element->new('~declaration'))->{'text'} = $text;
$pos->push_content($e);
return;
}
#==========================================================================
sub process {
#TODO: document this
return unless $_[0]->{'_store_pis'};
my($self, $text) = @_;
my $pos = $self->{'_pos'} || $self;
(my $e = HTML::Element->new('~pi'))->{'text'} = $text;
$pos->push_content($e);
return;
}
Those '_store_comments', '_store_declarations', and
'_store_pis' attributes don't exist (and therefore are false)
by default in TreeBuilder objects, but that can be turned on
with the attr() method. Here is some code that shows off using
all of these new pseudoelements:
use strict;
use HTML::Element 1.53;
use HTML::TreeBuilder 2.96;
print "Parser version: ", $HTML::Parser::VERSION, "\n";
{
my $x = HTML::Element->new('p');
my $c = HTML::Element->new('~comment');
$x->push_content($c);
$c->attr('text', 'I like potatoes!');
print "{\n", $x->as_HTML, "}\n\n";
$c->attr('text', ['I like potatoes!', 'carrots are good too', 'mmm']);
print "{\n", $x->as_HTML, "}\n\n";
$x->push_content(
# a ~literal allows anything. as_HTML just inserts whatever's
# in its 'text' attribute, unquestioningly
# maybe this should change to '_text' for all these ~-tags?
HTML::Element->new('~literal', 'text', "FNAR <>!<>!<\n\n> ZAZ KRAAA"),
'hoo<>boy!'
);
print "{\n", $x->as_HTML, "}\n\n";
$x->delete;
}
{
# And now, getting TreeBuilder to make some of these elements
# (Except ~literal -- currently nothing TreeBuilder does would
# call for making a ~literal node)
my $y = HTML::TreeBuilder->new();
$y->attr('_store_comments',1);
$y->attr('_store_declarations',1);
$y->attr('_store_pis',1); # PIs will work only under recent versions of
# HTML::Parser ... or maybe under some ancient ones, too. Not sure.
$y->parse(<<'EOTHING');
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"stuff">
<? xml-stylesheet href="lala" type="text/css" ?>
<p>hmmm <!-- lala
stuff things -->ohyeah!
EOTHING
$y->eof;
print "{", $y->as_HTML, "}\n\n";
$y->delete;
}
exit;
The above code outputs this:
Parser version: 2.20
{
<p><!--I like potatoes!-->
}
{
<p><!--I like potatoes!-- --carrots are good too-- --mmm-->
}
{
<p><!--I like potatoes!-- --carrots are good too-- --mmm-->FNAR <>!<>!<
> ZAZ KRAAAhoo<>boy!
}
{<html><head></head><body><? xml-stylesheet href="lala"
type="text/css" ?> <p>hmmm <!-- lala
stuff things -->ohyeah! </body><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"stuff"></html>
}
Note that old versions of Parser, as I used when I ran the above
code, do not recognize PIs -- that's why the PI in the source
ended up getting munged into "<? xml-stylesheet", etc.
Now, I don't see PIs as exactly crucial to HTML -- and if what
someone has is XHTML, they should be feeding it so an XML processor,
not TreeBuilder. However, since Parser supports recognizing them
(in recent versions at least), I thought I'd at least have a hack
at dealing with them. Same story with directives.
--
Sean M. Burke [EMAIL PROTECTED] http://www.netadventure.net/~sburke/