Hey Rob,This is awesome! However, let's say I have an unknown number of
floorplans in a table that looks like this:<tr>
<td align="right" valign="top"><b>Floor plan:</b></td>
<td>Ranch #1</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed"
value="04/01/2004" size="10" disabled></td>
<td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
</tr>
<tr>
<td align="right" valign="top"><b>Floor plan:</b></td>
<td>Mission #3</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed"
value="08/01/2009" size="10" disabled></td>
<td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
</tr>
<tr>
<td align="right" valign="top"><b>Floor plan:</b></td>
<td>Big house #9</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed"
value="last summer" size="10" disabled></td>
<td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
</tr> I would like to retrieve all of the plan/date/IDs, AND discard all those
plans that do not have a proper date_constructed such as "last summer".How
could I do that? Jeff
> Date: Tue, 26 Jul 2011 16:48:41 +0100
> From: rob.di...@gmx.com
> To: beginners@perl.org
> CC: johjeff...@hotmail.com
> Subject: Re: Parse HTML
>
> On 25/07/2011 21:17, Jeffrey Joh wrote:
> >
> > Hello, I'm trying to parse HTML files. I want to extract values from
> > tables (1) and from text fields (2). (1)<tr><td><img
> > src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
> >
> > <tr>
> > <td align="right" valign="top"><b>Floor plan:</b></td>
> > <td>
> > Ranch #1</td>
> > </tr> (2)
> > <input type="text" name="date_constructed" id="date_constructed"
> > value="04/01/2004" size="10" disabled> I would want to retrieve the floor
> > plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
> > (along with many other text boxes). What is an easy way of doing that?
> > Jeff
>
> Hello Jeff
>
> I am unclear what you want to do. The HTML fragments you have shown are
> syntactically incorrect, and in any case are irrelevant out of the
> context of a complete HTML document.
>
> However I think I can help a little. The HTML::TreeBuilder module will
> build an HTML::Element object for you that you can navigate, modify, and
> extract data from. It is very forgiving of incorrect syntax, and will
> try to build a complete HTML document from any fragment that you offer it.
>
> The program below seems to do what you want, but without testing against
> the complete data that you are dealing with I cannot vouch for its
> correctness. In particular you should add checks to verify that the HTML
> you are working with looks as you expect it to. I have written a couple
> such checks, but only you can improve on those.
>
> HTH,
>
> Rob
>
>
> use strict;
> use warnings;
>
> use HTML::TreeBuilder;
>
> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
>
> print "Working from HTML:\n\n";
> print $tree->as_HTML(undef, ' '), "\n\n";
>
> # Find an <input> element with an 'id' atttribute of 'date_constructed'
> # (there should be only one). The date required comes from the 'value'
> # attribute of that element.
> #
> my $date_tr = $tree->look_down(
> _tag => 'input',
> id => 'date_constructed',
> )
> or die "No construction date";
> my $plan_date = $date_tr->attr('value');
>
> # Now look up the tree to the containing <tr> element, and find its previous
> # sibling <tr> which contains the floor plan text in the second <td> child
> # element
> #
> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
> my @tds = $plan_tr->look_down(_tag => 'td');
> die "Unexpected format" unless @tds == 2;
>
> my $plan_text = $tds[1]->as_trimmed_text;
>
> print "Plan found: $plan_text on $plan_date\n";
>
> __DATA__
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>
> Ranch #1 </td>
> </tr>
> <input type="text" name="date_constructed" id="date_constructed"
> value="04/01/2004" size="10" disabled>
>
> **OUTPUT**
>
> Working from HTML:
>
> <html>
> <head>
> </head>
> <body>
> <table>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td> Ranch #1 </td>
> </tr>
> <tr>
> <td><input disabled id="date_constructed" name="date_constructed"
> size="10" type="text" value="04/01/2004" /></td>
> </tr>
> </table>
> </body>
> </html>
>
> Plan found: Ranch #1 on 04/01/2004
>
> Tool completed successfully
>
>