breaking encapsulation on HTML::TreeBuilder|Element objects

Sean M. Burke Mon, 27 Mar 2000 19:31:21 -0800
At 06:41 PM 2000-03-27 +0200, Reinier Post wrote:
>[...] 
>| * Does anyone write applications using HTML::Element that break
>| encapsulation on HTML::Element objects?  That is, by accessing object
>| contents directly (like $node->{"id"}) instead of using accessors,
>| like $node->attr("id")?
>
>I would rather not do that, but the interface requires
>some extensions/modifications (as explained below).
>[...]

Okay, everyone, be aware that:
* $node->attr('attrname', undef) now deletes the attribute
* $node->all_attr() returns a list of the key/value pairs of all
attributes.  (Example return values: ('_tag', 'br', 'clear', 'all', 'id',
'123').
* $node->all_external_attr() returns a list of the key/value pairs of all
"external" attributes (i.e., things corresponding to actual attributes in
the SGML element, excluding things like _tag, _implicit, _content, etc.)
Example return value: ('clear', 'all', 'id', 123).

I suppose I could add something returning just the names of attributes, or
of and external attributes.  Would anyone find that useful?

In any case, I hope that the new methods, above, mean that no-one ever has
to break encapsulation on an HTML::Element object.  Am I right, or are
there cases I've not thought of?

>| * Does anyone do /anything/ with HTML::Element trees, aside from
>| traversing the tree and read attributes off of nodes?  If so, do tell.
>
>(Same answer.)  At the moment I'm only deleting or renaming things,
>which is already a problem.

Aside from the problem of deleting attributes (now solved, I hope),
modifying while traversing (fundamentally intractible, I think), and the
frameset problem (I'm working on it), what problems are you running into?


If you want advice on how to have your traverse callback make notes on
modifying a tree when the traversal is finished, do say so.  There's
several ways to do it.


BTW, other things I'm quite likely to add in the next release: an option
for controlling what elements as_HTML will return close-tags for.  (I.e.,
an ability to override the current set, which currently works to suppress
such things as </p> and </li>.)

I've talked about this before, but I should really get around to having
TreeBuilder and Element get their information about HTML from an external
module, instead of each poking around at eachother's package variables.
That parametrics module would be available separately in CPAN for all to use.

I've been puzzling around with the idea of changing the underlying type of
HTML::Element objects to be arrays.  True, hashes seem the obvious choice
for representing elements (since an element's attributes are a key=value
mapping), and hashes are nice and fast.  But since the average element has
no external attributes at all, and it's almost unheard-of for an element to
have more than three external attribues, the additional memory overhead of
having a hash (instead of an array) for every element in a potentially very
large parse tree is pretty significant.

The most radical version of this idea would be to change from this
representation:
  {'_tag' => 'foo',
   '_parent' => some_node,
   '_content' => [node2, node3],
   'id' = 'stuff',
  }
to something like:
  [
   'foo',      # 0: always for the tag name
   some_node,  # 1: always for the parent node
   5           # 2: index of the start of contents
   'id',       # 3 to $self->[2]-1: attribute keys and values
     'stuff',
   node2,      # $self->[2] - $#$self : contents (children)
   node3,
  ]

This would basically mean breaking any code that uses the return value of
$node->content to do things like:
  $last = pop @{$node->content};
Worst off, it would be silently broken -- it would successfully modify the
temporary listref I could have the new content() method return, but this,
silently, would have no effect on the actual content of $node.

And while it's tempting to say people who break encapsulation get what they
deserve when their code breaks (whether horribly, or silently), the fact is
that Element's interface was so limited until recently, that developers had
to break encapsulation to get much of anything done.  

--
Sean M. Burke [EMAIL PROTECTED] http://www.netadventure.net/~sburke/
breaking encapsulation on HTML::TreeBuilder|Element objects

Reply via email to