On 26/07/2011 21:12, Jeffrey Joh wrote:
> On 26 Jul 2011 16:48, Rob Dixon wrote:
>> On 25/07/2011 21:17, Jeffrey Joh wrote:
>>>
>>> Hello, I'm trying to parse HTML files. I want to extract values from
>>> tables (1) and from text fields (2). (1)<tr><td><img
>>> src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
>>> <tr>
>>> <td align="right" valign="top"><b>Floor plan:</b></td>
>>> <td>Ranch #1</td>
>>> </tr> (2)
>>> <input type="text" name="date_constructed" id="date_constructed" 
>>> value="04/01/2004" size="10" disabled>
>>>
>>> I would want to retrieve the floor plan (Ranch #1) and the date
>>> constructed (04/01/2004) from each HTML file (along with many
>>> other text boxes). What is an easy way of doing that? Jeff
>>
>> I am unclear what you want to do. The HTML fragments you have shown are
>> syntactically incorrect, and in any case are irrelevant out of the
>> context of a complete HTML document.
>>
>> However I think I can help a little. The HTML::TreeBuilder module will
>> build an HTML::Element object for you that you can navigate, modify, and
>> extract data from. It is very forgiving of incorrect syntax, and will
>> try to build a complete HTML document from any fragment that you offer it.
>>
>> The program below seems to do what you want, but without testing against
>> the complete data that you are dealing with I cannot vouch for its
>> correctness. In particular you should add checks to verify that the HTML
>> you are working with looks as you expect it to. I have written a couple
>> such checks, but only you can improve on those.
>>
>>
>> use strict;
>> use warnings;
>>
>> use HTML::TreeBuilder;
>>
>> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
>>
>> print "Working from HTML:\n\n";
>> print $tree->as_HTML(undef, ' '), "\n\n";
>>
>> # Find an <input> element with an 'id' atttribute of 'date_constructed'
>> # (there should be only one). The date required comes from the 'value'
>> # attribute of that element.
>> #
>> my $date_tr = $tree->look_down(
>> _tag => 'input',
>> id => 'date_constructed',
>> )
>> or die "No construction date";
>> my $plan_date = $date_tr->attr('value');
>>
>> # Now look up the tree to the containing <tr> element, and find its previous
> > # sibling <tr> which contains the floor plan text in the second <td> child
>> # element
>> #
>> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
>> my @tds = $plan_tr->look_down(_tag => 'td');
>> die "Unexpected format" unless @tds == 2;
>>
>> my $plan_text = $tds[1]->as_trimmed_text;
>>
>> print "Plan found: $plan_text on $plan_date\n";
>>
>> __DATA__
>> <tr>
>> <td align="right" valign="top"><b>Floor plan:</b></td>
>> <td> Ranch #1 </td>
>> </tr>
>> <input type="text" name="date_constructed" id="date_constructed" 
>> value="04/01/2004" size="10" disabled>
>>
>> **OUTPUT**
>>
>> Plan found: Ranch #1 on 04/01/2004
>
> This is awesome! However, let's say I have an unknown number of 
> floorplans in a table that looks like this:
>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Ranch #1</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed" 
> value="04/01/2004" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
> </tr>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Mission #3</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed" 
> value="08/01/2009" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
> </tr>
> <tr>
> <td align="right" valign="top"><b>Floor plan:</b></td>
> <td>Big house #9</td>
> </tr>
> <tr><td><input type="text" name="date_constructed" id="date_constructed" 
> value="last summer" size="10" disabled></td>
> <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
> </tr>

Hi Jeff

Please bottom-post your replies here. It is the standard for the list,
and long and complex threads can quickly become incomprehensible if
posts are made at both ends of the quoted message. Thank you.

To achieve this, all you need to do is find all of the <input> elements
with an id attribute of 'date_constructed'. The plan name can be found
from each of these as before. Take a look at the program below.

HTH,

Rob



use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print "Working from HTML:\n\n";
print $tree->as_HTML(undef, '  '), "\n\n";

# Find all <input> elements with an 'id' atttribute of 'date_constructed'.
#
my @date_tr = $tree->look_down(
  _tag => 'input',
  id   => 'date_constructed',
)
or die "No construction dates";

# Look at each <input> element found, taking the date string from its 'value'
# attribute
#
for my $date_tr (@date_tr) {

  my $plan_date = $date_tr->attr('value');

  # Now look up the tree to the containing <tr> element, and find its previous
  # sibling <tr> which contains the floor plan text in the second <td> child
  # element
  #
  my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
  my @tds = $plan_tr->look_down(_tag => 'td');
  die "Unexpected format" unless @tds == 2;

  my $plan_text = $tds[1]->as_trimmed_text;

  print "Plan found: $plan_text on $plan_date\n";
}

__DATA__
<tr>
 <td align="right" valign="top"><b>Floor plan:</b></td>
 <td>Ranch #1</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="04/01/2004" size="10" disabled></td>
<td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
 </tr>
<tr>
 <td align="right" valign="top"><b>Floor plan:</b></td>
 <td>Mission #3</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="08/01/2009" size="10" disabled></td>
<td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
 </tr>
<tr>
 <td align="right" valign="top"><b>Floor plan:</b></td>
 <td>Big house #9</td>
</tr>
<tr><td><input type="text" name="date_constructed" id="date_constructed" 
value="last summer" size="10" disabled></td>
<td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
 </tr>

**OUTPUT**

Working from HTML:

<html>
  <head>
  </head>
  <body>
    <table>
      <tr>
        <td align="right" valign="top"><b>Floor plan:</b></td>
        <td>Ranch #1</td>
      </tr>
      <tr>
        <td><input disabled id="date_constructed" name="date_constructed" 
size="10" type="text" value="04/01/2004" /></td>
        <td><input id="ID453" name="ID" size="10" type="text" value="453" 
/></td>
      </tr>
      <tr>
        <td align="right" valign="top"><b>Floor plan:</b></td>
        <td>Mission #3</td>
      </tr>
      <tr>
        <td><input disabled id="date_constructed" name="date_constructed" 
size="10" type="text" value="08/01/2009" /></td>
        <td><input id="ID986" name="ID" size="10" type="text" value="986" 
/></td>
      </tr>
      <tr>
        <td align="right" valign="top"><b>Floor plan:</b></td>
        <td>Big house #9</td>
      </tr>
      <tr>
        <td><input disabled id="date_constructed" name="date_constructed" 
size="10" type="text" value="last summer" /></td>
        <td><input id="ID354" name="ID" size="10" type="text" value="354" 
/></td>
      </tr>
    </table>
  </body>
</html>

Plan found: Ranch #1 on 04/01/2004
Plan found: Mission #3 on 08/01/2009
Plan found: Big house #9 on last summer

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to