Dear list, This message is from a new LWP user; to be specific, I'm trying to use HTML::TreeBuilder for a HTML rewriting proxy. In http://www.ics.uci.edu/pub/websoft/libwww-perl/archive/2000h1/0021.html Sean Burke asked some questions about HTML::TreeBuilder; my answers: | * Is there a desire on the part of users of HTML::Element (whether | directly or via HTML::TreeBuilder) to have it be more | memory-efficient? Not yet (I'm nowhere near the production stage). | * Does anyone write applications using HTML::Element that break | encapsulation on HTML::Element objects? That is, by accessing object | contents directly (like $node->{"id"}) instead of using accessors, | like $node->attr("id")? I would rather not do that, but the interface requires some extensions/modifications (as explained below). | * Does anyone have any applications that actually /move/ nodes around | in a tree of HTML::Element objects? As opposed to simply taking the | structure HTML::TreeBuilder gives you, and traversing it, but never | changing it? Not yet, but this ability is the reason I'm using HTML::TreeBuilder in the first place (I could stick with HTML::Parser otherwise). | * Does anyone do /anything/ with HTML::Element trees, aside from | traversing the tree and read attributes off of nodes? If so, do tell. (Same answer.) At the moment I'm only deleting or renaming things, which is already a problem. | * In short, what do you all use TreeBuilder for? (A HTML rewriting proxy.) Actually, I started from Abiprox, http://www.foad.org/~abigail/Perl/ which relies on HTML::Parser, but doesn't use HTML::TreeBuilder. The reason I switched to using HTML::TreeBuilder is the need for tree-wide selection criteria and transformations, which requires a full syntax tree. But some unexpected limitations have appeared: 1) how do I delete element attributes? There seems to be no way, except by breaking encapsulation, to remove attributes from elements (for instance, the background attribute from the body element). My current solution is to redefine attr(), making it delete an attribute if a value of undef or '' is supplied: sub HTML::Element::attr { my $self = shift; my $attr = lc shift; if (@_) { # set #warn("a value of '$_[0]' was supplied for the '$attr' attribute\n"); my $old = $self->{$attr}; #--- new code by rp: --- # if (!defined $_[0] || !length $_[0]) { # deletion requested # we have to include '', undef is transformed into '' # on its way here, somehow #warn("deleting the '$attr' attribute\n"); return delete $self->{$attr}; } # #--- end of new code by rp ---# $self->{$attr} = $_[0]; return $old } else { # get return $self->{$attr}; } } 2) how do I modify a tree while traversing it? traverse() gets confused if an element is deleted during traversal. My current solution is to redefine it to use a copy, rather than a reference to, the array of children; the diff: 16,19c16,19 < my $children_r = $self->{'_content'}; < for(my $i = 0; $i < @$children_r; ++$i) { < if(ref $children_r->[$i]) { # a real node < $sub->( $children_r->[$i] ); # recurse. --- > my @children_r = @{$self->{'_content'}}; > for(my $i = 0; $i < @children_r; ++$i) { > if(ref $children_r[$i]) { # a real node > $sub->( $children_r[$i] ); # recurse. 22c22 < $callback->( $children_r->[$i], 1, $depth, $self, $i ) --- > $callback->( $children_r[$i], 1, $depth, $self, $i ) I'm not aware of any real drawbacks. In the average case, having to copy the array won't induce much overhead. 3) what's the matter with framesets? <frameset>s are wrapped within <body>, which causes framed pages to turn up empty, at least in my version of Netscape. Why does this happen? Is it a design decision or just an byproduct of the implementation? (I haven't worked around this yet.) If these problems can be addressed within HTML::TreeBuilder, Despite these problems, I must say both LWP and its documentation are a joy to use. -- Reinier Post [EMAIL PROTECTED]
