Ya know, I'm sure there's a place for all of these. However, web::scraper
works great with the xpath that element inspectors return. It's real easy to
use and you can easily return variable types that suite your output best.
Ie, a hash with field names per table element for dbic.
On Jul 26, 2011 4:15 PM, "Jeffrey Joh" <johjeff...@hotmail.com> wrote:
>
> Hey Rob,This is awesome! However, let's say I have an unknown number of
floorplans in a table that looks like this:<tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Ranch #1</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed"
value="04/01/2004" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
> </tr>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Mission #3</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed"
value="08/01/2009" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
> </tr>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Big house #9</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed"
value="last summer" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
> </tr> I would like to retrieve all of the plan/date/IDs, AND discard all
those plans that do not have a proper date_constructed such as "last
summer".How could I do that? Jeff
> > Date: Tue, 26 Jul 2011 16:48:41 +0100
>> From: rob.di...@gmx.com
>> To: beginners@perl.org
>> CC: johjeff...@hotmail.com
>> Subject: Re: Parse HTML
>>
>> On 25/07/2011 21:17, Jeffrey Joh wrote:
>> >
>> > Hello, I'm trying to parse HTML files. I want to extract values from
>> > tables (1) and from text fields (2). (1)<tr><td><img
>> > src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
>> >
>> > <tr>
>> > <td align="right" valign="top"><b>Floor plan:</b></td>
>> > <td>
>> > Ranch #1</td>
>> > </tr> (2)
>> > <input type="text" name="date_constructed" id="date_constructed"
value="04/01/2004" size="10" disabled> I would want to retrieve the floor
plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
(along with many other text boxes). What is an easy way of doing that? Jeff
>>
>> Hello Jeff
>>
>> I am unclear what you want to do. The HTML fragments you have shown are
>> syntactically incorrect, and in any case are irrelevant out of the
>> context of a complete HTML document.
>>
>> However I think I can help a little. The HTML::TreeBuilder module will
>> build an HTML::Element object for you that you can navigate, modify, and
>> extract data from. It is very forgiving of incorrect syntax, and will
>> try to build a complete HTML document from any fragment that you offer
it.
>>
>> The program below seems to do what you want, but without testing against
>> the complete data that you are dealing with I cannot vouch for its
>> correctness. In particular you should add checks to verify that the HTML
>> you are working with looks as you expect it to. I have written a couple
>> such checks, but only you can improve on those.
>>
>> HTH,
>>
>> Rob
>>
>>
>> use strict;
>> use warnings;
>>
>> use HTML::TreeBuilder;
>>
>> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
>>
>> print "Working from HTML:\n\n";
>> print $tree->as_HTML(undef, ' '), "\n\n";
>>
>> # Find an <input> element with an 'id' atttribute of 'date_constructed'
>> # (there should be only one). The date required comes from the 'value'
>> # attribute of that element.
>> #
>> my $date_tr = $tree->look_down(
>> _tag => 'input',
>> id => 'date_constructed',
>> )
>> or die "No construction date";
>> my $plan_date = $date_tr->attr('value');
>>
>> # Now look up the tree to the containing <tr> element, and find its
previous
>> # sibling <tr> which contains the floor plan text in the second <td>
child
>> # element
>> #
>> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
>> my @tds = $plan_tr->look_down(_tag => 'td');
>> die "Unexpected format" unless @tds == 2;
>>
>> my $plan_text = $tds[1]->as_trimmed_text;
>>
>> print "Plan found: $plan_text on $plan_date\n";
>>
>> __DATA__
>> <tr>
>> <td align="right" valign="top"><b>Floor plan:</b></td>
>> <td>
>> Ranch #1 </td>
>> </tr>
>> <input type="text" name="date_constructed" id="date_constructed"
value="04/01/2004" size="10" disabled>
>>
>> **OUTPUT**
>>
>> Working from HTML:
>>
>> <html>
>> <head>
>> </head>
>> <body>
>> <table>
>> <tr>
>> <td align="right" valign="top"><b>Floor plan:</b></td>
>> <td> Ranch #1 </td>
>> </tr>
>> <tr>
>> <td><input disabled id="date_constructed" name="date_constructed"
size="10" type="text" value="04/01/2004" /></td>
>> </tr>
>> </table>
>> </body>
>> </html>
>>
>> Plan found: Ranch #1 on 04/01/2004
>>
>> Tool completed successfully
>>
>>
>

Reply via email to