Re: Regex(?)

John W. Krahn Sat, 26 Jul 2003 14:35:37 -0700

Douglass Franklin wrote:
> 
> I'm trying to transform this html table to a colon-delimited flat-file


Why colon separated?  What if one of the fields has a colon?


> database.  This is what I have so far:
> 
> HTML:
> <tr><td class='bodyblack' width='50%'><a
> href='http://jsearch.usajobs.opm.gov/summary.asp?OPMControl=IC9516'
> class='jobrlist'><font size='2'>ACCOUNTANT
> </font></a></td><td class='bodyblack' width='40%'>$24,701.00
>  - $51,971.00
> </td><td class='bodyblack'>INDEFINITE</td></tr>
> <tr><td class='bodyblack'>CONTINENTAL U.S., US</td>
> </tr><td class='bodyblack' colspan='3'>&nbsp </td></tr>
> 
> Database Record (wanted):
> Accountant:$24,701.00 - $51,971.00:INDEFINITE:CONTINENTAL U.S., US
> 
> Regex I have:
> $jobrecord =~ ^(<tr>)(<td class='bodyblack' width='50%'>)(.+)(&nbsp
> </td></tr>)$
> 
> However, this doesn't seem to be working.  Please help.

This will work and was tested on the attached page from http://jsearch.usajobs.opm.gov/

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TokeParser;

my $p = HTML::TokeParser->new( 'page1.html' ) or die "Cannot open page1.html: $!";

my @data;

TABLE:
while ( my $token = $p->get_token() ) {
    my @table;
    if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'center' ) {
        $token = $p->get_token();
        if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'table' ) {
            $token = $p->get_token();
            if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'tr' ) {
                $token = $p->get_token();
                if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'td' ) {
                    $token = $p->get_token();
                    if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'strong' ) {
                        while ( $token = $p->get_token() ) {
                            push @table, $token->[ 1 ] if $token->[ 0 ] eq 'T';
                            if ( $token->[ 0 ] eq 'S' and $token->[ 1 ] eq 'center' ) {
                                $p->unget_token( $token );
                                s/&nbsp/ /g, s/^\s+//, s/\s+$// for @table;
                                push @data, join ':', @table;
                                next TABLE;
                                }
                            }
                        }
                    }
                }
            }
        }
    }

print "$_\n" for @data;

__END__


HTH

John
-- 
use Perl;
program
fulfillment

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Regex(?)

Reply via email to