Re: Html::tokeparser::simple

drieux Wed, 26 Nov 2003 17:51:24 -0800

On Wednesday, Nov 26, 2003, at 12:30 US/Pacific, Paul Kraus wrote:

Someone want to show me how this module can help parse out html?
I want to grap text between <td>text</td> being able to apple regexp to
get what I want.
The problem is my text is among 10,000 td tags. With the only difference being what the above <th> tag has in it.

So if th tag = then store text between <td> into an array.

my first concern here is did you mean <th> or <tr>?

a simple table would look like:
        <table>
                <tr>
                        <th>header1</th>
                        <th>header2</th>
                        <th>header3</th>
                </tr>
                <tr>
                        <td>_Row_1_Cell_1_</td>
                        <td>_Row_1_Cell_2_</td>
                        <td>_Row_1_Cell_3_</td>
                </tr>
                <tr>
                        <td>_Row_2_Cell_1_</td>
                        <td>_Row_2_Cell_2_</td>
                        <td>_Row_2_Cell_3_</td>
                </tr>
                <tr>
                        <td>_Row_3_Cell_1_</td>
                        <td>_Row_3_Cell_2_</td>
                        <td>_Row_3_Cell_3_</td>
                </tr>
        </table>

You have almost written your algorithm

        while( my $token = $p->get_token)
        {
                last if ($token->is_start_tag('table'));     }

        # there is a Table opening Tag, our hope now is that
        # we can get our Keys from the headers

        my $count = 0;
        my $header = {};

        while( my $token = $p->get_token)
        {
                next if ($token->is_start_tag( qr/t[rd]/)); # don't care
                last if ($token->is_end_tag('/tr'));  # finished with headers
                if ($token->is_end_tag('/td'))
                {
                        $count++;
                        next;
                }
                if ( $token->is_text())
                {
                        my $text = $token->as_is();
                        $header->{$count} = $text
                                if ( $text =~ <some_pattern>);
                }
        }

        #
        # read the first row of headers, now to meander forward
        #
At this point we know that IF

        if(defined($header->{$count}))
                this is a column we have to grot data from
                into the storage set up

and that would be basically like the way that we
grotted out the header sections, which is left as
an exercise for the reader.

CAVEAT: simply because it looks like Perl,
does not mean that I have written Perl, or that
the code will actually work. It is merely a demonstration
in algorithm creation.

ciao
drieux

---


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Html::tokeparser::simple

Reply via email to