Re: Encoding angle brackets in HTML text nodes

Webley Silvernail Fri, 24 Feb 2012 08:03:00 -0800


On 23/02/2012 00:59, Webley Silvernail wrote:
>>
>> I have an HTML page that is updated automatically each day. I am
>> using  HTML::TreeBuilder to create and insert the new content.
>>
>> Most of the time, this works fine, but I've hit a snag when existing
>> text nodes on the page includes a gt or lt symbol.
>>
>>
>> For example, I might have an existing element on the page that looks
>> like this:
>>
>> <td>&lt;B</td>
>>
>> When the page is updated, depending on how I print the output, this
>> may cause problems.
>>
>> Some techniques I use to print the output work OK for the new part
>> but affect the existing content adversely. Other techniques work well
>> with the existing content but cause problems with the new content.
>>
>> Here are some of the output approaches I have tried:
>>
>> I.
>>
>> print OUT $root->as_HTML('', '', {});
>>
>>
>> Results: new content looks good, but the existing content is affected:
>>
>> <td><B</td>     #The browser won't render this and generally just blanks out 
>> the text node.
>>
>>
>> II.
>> print OUT $root->as_HTML('<>&', '', {});
>>
>> Results: existing content looks good; new content is output with all
>> of the< > in the HTML source encoded as entity references (i.e. raw
>> HTML is rendered by the browser).
>>
>> III.
>> use Encode qw(encode decode);
>> ...
>> my $string_rep = $root->as_HTML('<>&', '', {});
> print OUT encode('UTF-8',$string_rep);
>>
>>
>> Results: same as test II.
>>
>> IV.
>> use HTML::Entities;
>> ...
>> my $string_rep = $root->as_HTML('<>&', '', {});
>> print OUT encode_entities($string_rep);
>>
>>
>> Results: Entire page is output with all of the< > in the HTML source
>> encoded as entity references (i.e. raw HTML is rendered by the browser).
>>
>>
>> V.
>> Various iterations of the above approaches using a subsequent call to
>> HTML Tidy to attempt to clean up the HTML.

>Hey Webley
>
>Approach II is the correct one. The problem is with the way you are
>adding your new content, which is presumably as a text string (in which
>case HTML::Element is correct to render it as text!).
>
>The correct way is to build an HTML::Element tree with calls like
>
>  my $tree = HTML::TreeBuilder->new_from_content($content);
>
>   my $new = HTML::ELement->new('b');
>   $new->push_content('This text in BOLD');
>
>   my $place = $tree->look_down(_tag => 'div', id => 'insertion');
>   $place->push_content($new);
>
>all depending on what you want to insert and how you locate the place in
>the document to insert it. The above will build content like
>
>  <b>This text in BOLD</b>
>
>and insert it under an element marked
>
>  <div id="insertion">
>
>An alternative is to pass your new string to HTML::Treebuilder to build
>a new HTML fragment from your string and then insert that into your
>document.
>
>HTH,
>
>Rob

Hi, Rob -
Thanks for the response.  I *think* I'm already inserting my content in the way 
you describe, but perhaps I am not. I should have been less generic in my 
original message. 

My
 script connects to a database to retrieve the current day's updates. It
 uses these results to update the HTML page either with a table 
summarizing the new data or a message indicating that no new records 
were added.

I am using HTML::Element->new() to create new 
elements and then using either push_content() or unshift_content() to 
insert the new content back into my tree object.

Here's my tree object:
my $root = HTML::TreeBuilder->new;       # Is new_from_content different? It 
doesn't seem so from Perldoc, but I could be wrong.

And a fragment within the tree:
my $content = $root->look_down('id', 'fmsbody');

Here's an example of new content being inserted:
    my $div_date = HTML::Element->new('div','class'=>'date');
    $content->unshift_content($div_date);

The
 table is being constructed from a couple of subroutines. One creates 
the header row, the other cycles through the resultset to create the 
data rows.

Here's the part where the table is created:
      my $scn7_table = HTML::Element->new('table', 'class'=>'fmtable');
      my @scn7_col_heads = qw(EX FC TYPE DESCRIPTION);
      my $scn7_table_head = create_heading_row(\@scn7_col_heads);
      $scn7_table->push_content($scn7_table_head);

And here's the subroutine to insert the detail row which is where the only bad 
content will wind up:

sub create_detail_row {
my $data = shift;
my @recs = @$data;

      my $row = HTML::Element->new('tr');
      my $cell = HTML::Element->new('td', 'class'=>'fmdata');

      for (my $i=0; $i <= $#recs; $i++) {
          $row->push_content($cell->starttag().$recs[$i].$cell->endtag());
      }
      return $row;
}

It is called like this:
#Create detail rows and insert into table
while (my @recs = $sth7->fetchrow_array) {
           my $row = create_detail_row(\@recs);
           $scn7_table->push_content($row);
      }
      #Insert table into document tree
      $div_date->push_content($scn7_table);

Since
 the detail rows are where the problem lies, maybe that's the spot I need 
to check?  The 
$row->push_content($cell->starttag().$recs[$i].$cell->endtag());
 part?  I was thinking that by the time I get to the print OUT 
$root->as_HTML part, everything should be good to go, but alas that 
is not the case.

The problem data is a 2-char code that can be 
any combination of the alphanumerics plus a handful of special 
characters. In my original message, I used '<B' as an example, but 
that could also have been '<3' or '<%' or '<I'.   I guess HTML rendering agents 
would be more prone to choking on <B, <I, <A, etc., though.
Thanks again, 
Webley

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/
Re: Encoding angle brackets in HTML text nodes

Reply via email to