RE: Parse HTML

Jeffrey Joh Tue, 26 Jul 2011 13:13:22 -0700
Hey Rob,This is awesome!  However, let's say I have an unknown number of 
floorplans in a table that looks like this:<tr> 
 <td align="right" valign="top"><b>Floor plan:</b></td> 
 <td>Ranch #1</td> 
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="04/01/2004" size="10" disabled></td>
<td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
 </tr>
<tr> 
 <td align="right" valign="top"><b>Floor plan:</b></td> 
 <td>Mission #3</td> 
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="08/01/2009" size="10" disabled></td>
<td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
 </tr>
<tr> 
 <td align="right" valign="top"><b>Floor plan:</b></td> 
 <td>Big house #9</td> 
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="last summer" size="10" disabled></td>
<td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
 </tr> I would like to retrieve all of the plan/date/IDs, AND discard all those 
plans that do not have a proper date_constructed such as "last summer".How 
could I do that? Jeff
 > Date: Tue, 26 Jul 2011 16:48:41 +0100
> From: rob.di...@gmx.com
> To: beginners@perl.org
> CC: johjeff...@hotmail.com
> Subject: Re: Parse HTML
> 
> On 25/07/2011 21:17, Jeffrey Joh wrote:
> > 
> > Hello, I'm trying to parse HTML files.  I want to extract values from
> > tables (1) and from text fields (2).  (1)<tr><td><img
> > src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
> >
> > <tr>
> >   <td align="right" valign="top"><b>Floor plan:</b></td>
> >   <td>
> >     Ranch #1</td>
> > </tr>   (2)
> > <input type="text" name="date_constructed" id="date_constructed" 
> > value="04/01/2004" size="10" disabled>  I would want to retrieve the floor 
> > plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file 
> > (along with many other text boxes).  What is an easy way of doing that? 
> > Jeff                                  
> 
> Hello Jeff
> 
> I am unclear what you want to do. The HTML fragments you have shown are
> syntactically incorrect, and in any case are irrelevant out of the
> context of a complete HTML document.
> 
> However I think I can help a little. The HTML::TreeBuilder module will
> build an HTML::Element object for you that you can navigate, modify, and
> extract data from. It is very forgiving of incorrect syntax, and will
> try to build a complete HTML document from any fragment that you offer it.
> 
> The program below seems to do what you want, but without testing against
> the complete data that you are dealing with I cannot vouch for its
> correctness. In particular you should add checks to verify that the HTML
> you are working with looks as you expect it to. I have written a couple
> such checks, but only you can improve on those.
> 
> HTH,
> 
> Rob
> 
> 
> use strict;
> use warnings;
> 
> use HTML::TreeBuilder;
> 
> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
> 
> print "Working from HTML:\n\n";
> print $tree->as_HTML(undef, '  '), "\n\n";
> 
> # Find an <input> element with an 'id' atttribute of 'date_constructed'
> # (there should be only one). The date required comes from the 'value'
> # attribute of that element.
> #
> my $date_tr = $tree->look_down(
>   _tag => 'input',
>   id   => 'date_constructed',
> )
> or die "No construction date";
> my $plan_date = $date_tr->attr('value');
> 
> # Now look up the tree to the containing <tr> element, and find its previous
> # sibling <tr> which contains the floor plan text in the second <td> child
> # element
> #
> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
> my @tds = $plan_tr->look_down(_tag => 'td');
> die "Unexpected format" unless @tds == 2;
> 
> my $plan_text = $tds[1]->as_trimmed_text;
> 
> print "Plan found: $plan_text on $plan_date\n";
> 
> __DATA__
> <tr>
>  <td align="right" valign="top"><b>Floor plan:</b></td>
>  <td>
>    Ranch #1  </td> 
> </tr>
> <input type="text" name="date_constructed" id="date_constructed" 
> value="04/01/2004" size="10" disabled>
> 
> **OUTPUT**
> 
> Working from HTML:
> 
> <html>
>   <head>
>   </head>
>   <body>
>     <table>
>       <tr>
>         <td align="right" valign="top"><b>Floor plan:</b></td>
>         <td> Ranch #1 </td>
>       </tr>
>       <tr>
>         <td><input disabled id="date_constructed" name="date_constructed" 
> size="10" type="text" value="04/01/2004" /></td>
>       </tr>
>     </table>
>   </body>
> </html>
> 
> Plan found: Ranch #1 on 04/01/2004
> 
> Tool completed successfully
> 
>
RE: Parse HTML

Reply via email to