Siegfried Heintze wrote:
> I'm trying to screen scape some information off the web.
> 
> I anticipate that I'll want to have it multi-threaded. 
> 
> As per Lincoln Stein's book, I'm using HTML::Parser and passing a function
> pointer (you can tell I'm a C programmer) to $parser->handler(start=>
> \&start, 'self,tagname, attr,text,skipped_text');
> 
> The problem is that I'm using a lot of non-local variables (what are they
> called, global?) in function start.
>
> As per Lincoln's example, start is a non-member function (not a method).
> It's just a stand alone function.
> 
> I wish I could pass some parameters to my start function. I want each thread
> to have its own copy of those global variables.
>

The problem is probably here and maybe in your logic. Because 'start' is
an event handler you don't get to dictate what is or isn't passed to it.
The parsing engine does, so it has to provide a facility through which
you can pass your values, the problem is, to my knowledge the
HTML::Parser doesn't provide such a mechanism. Having said that, you
should be able to use globals in your start methods without any problem,
aka you shouldn't have to pass them?

>From the sounds of it I may not be understanding your setup, but you
should be able to pass the globals to a thread and then localize them
from there.  Have you actually setup the threading?

> The documentation at http://search.cpan.org/~gaas/HTML-Parser-3.45/Parser.pm
> says   
> 
> 
> "$p->handler(start =>  "start", 'self, attr, attrseq, text' );
> 
> This causes the "start" method of object $p to be called for 'start' events.
> The callback signature is $p->start(\%attr, [EMAIL PROTECTED], $text)."
> 
> OK, that is news to me. Lincoln's example does not define start as a member
> function (method, I guess is the proper name). 
>

It can be setup either way. You can make it a normal subroutine and pass
the reference to the sub, or you can make it a method of an object that
subclasses the Parser. I believe it is showing how both will work. There
is a third option too, but I can't recall how it works.

> So if I could define start as a method, that would solve my problem. How do
> I do that? Do I have to inherit from HTML::Parser? Anyone got an example?
>

You would have to inherit from HTML parser, then when you created your
sub classed parser you provide a method of it that is called when start
events occur. I think you need to separate your thinking of 'start' as a
function, it is really an 'event'.  So think about the corresponding
value as an 'event handler' and it should be easier to wrap your head
around.

For examples check out the HTML::TokeParser and HTML::PullParser, they
are subclasses of HTML::Parser.

I only have an example of the subroutine manner, which looks like:

--UNTESTED--

    $parser->handler( start =>
    sub {
        my ($tagname, $attrs, $attrseq, $text) = @_;

        if ($tagname eq 'img') {
            my $replaced = '<img';
            foreach my $attr (@$attrseq) {
                if ($attr eq 'src') {
                    my $src = $attrs->{$attr};
                    my $name = basename($src);

                    push @{$Scratch->{'parsed_html'}->{'image_list'}},
$name;

                    $attrs->{$attr} = "/images/lp/$label/$name";
                }
                $replaced .= " $attr=\"$attrs->{$attr}\"";
            }
            $replaced .= ' />';
            push @parsed, $replaced;
            $Scratch->{'parsed_html_image_count'} =
@{$Scratch->{'parsed_html'}->{'image_list'}};
        }
        elsif ($tagname eq 'area') {
            my $replaced = '<area';
            foreach my $attr (@$attrseq) {
                if ($attr eq 'href') {
                    if (lc $attrs->{$attr} eq 'rsvp') {
                        $attrs->{$attr} = "/ic/lp/$label/res_form";
                    }
                }
                $replaced .= " $attr=\"$attrs->{$attr}\"";
            }
            $replaced .= ' />';
            push @parsed, $replaced;
        }
        else {
            push @parsed, $text;
        }
    }, 'tagname, attr, attrseq, text' );

This is pretty specific to my application but should be a decent example
of how to manipulate the URL in the 'src' attribute of an 'img' tag, and
the 'href' attribute of an 'area' tag.

Ignore the 'Scratch' stuff.

Good luck,

http://danconia.org

> Thanks,
> Siegfried
> 
> 

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to