RE: [Boston.pm] multiline search and replace

Kee Hinckley Thu, 05 Apr 2001 15:25:25 -0700
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At 9:57 AM -0400 4/5/01, Van Schooenderwoert, Nancy wrote:
>I'd be interested in seeing how to do the below... can you post a code
>snippet to the list, or is it on a webpage? (I've been away - am catching up
>on prev. PerlMonger mail).

I'm writing a screen-scraper right now that takes pages off of one 
web site and puts them on another (two sites at the same company, 
don't ask).

Anyway, someone asked me how to do a crawler out of the blue (I mean 
out of the blue, it was a random email from someone to 
[EMAIL PROTECTED]), and I sent them this.  I have no idea if 
it'll help you, but....

When I'm done with it I plan on cleaning it up a bit and releasing 
it.  They way I have it set up you only have to write a few dozen 
lines to be able to get data from a new web page--but right now it's 
all specific to the client's web site.

At 3:22 AM +0100 3/30/01, Majid Ali wrote:
>i need to create a simple web crawler in perl and i am finding it difficult
>what steps need to be taken when creating one ..i dont have much programming
>background...will i be able to download one off the internet
>
>
>thanks

Well, I have no idea why you are asking [EMAIL PROTECTED], but....

Take a look at LWP::UserAgent and HTML::Parser

In particular:

         $UA = new LWP::UserAgent;
         $Cookies = new HTTP::Cookies;
         $UA->cookie_jar($Cookies);
         $UA->timeout(60);
         $UA->agent("Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)");

         foreach $key (keys %LoginData) {
             $content .= uri_escape($key) . '=' . 
uri_escape($LoginData{$key}) . '&';
         }
         chop $content;

         $req = new HTTP::Request POST => $LoginURL;
         $req->content($content);
         $req->content_type('application/x-www-form-urlencoded');
         $req->content_length(length($content));

         $res = $UA->request($req);
         if (!$res->is_redirect) {
             die("$url: Login Failed\n" . $res->error_as_HTML());
         }
     }

     $req = new HTTP::Request GET => $url;
     $req->referer($DefaultURL);
     while (($head, $value) = each(%headers)) {
         next if ($AlwaysFetch && lc($head) eq  'if-modified-since');
         $req->header($head, $value);
     }
     $res = $UA->request($req);
     if ($res->is_success) {
         if (wantarray()) {
             return (IO::Scalar->new_tie($res->content_ref), $res->base());
         } else {
             return IO::Scalar->new_tie($res->content_ref);
         }

You won't want all of that, but that gives you an idea of how to 
fetch the page, the second part is to parse it:

For that create a subclass of HTML::Parser

use HTML::Parser 3 ();
@ISA = qw(HTML::Parser);

and define some handlers.  For crawling all you really need is the 
start handler

sub new {
     my $pkg = shift;
     my $this = $pkg->SUPER::new(api_version => 3, strict_comment => 0,
                                 unbroken_text => 1);
     $this->handler('start' => 'start_h', 'self, tagname, attr, attrseq, text');
     $this->handler('end' => 'end_h', 'self, tagname, text');
     $this->handler('text' => 'text_h', 'self, text');
     $this->handler('comment' => 'comment_h', 'self,text');
     $this->handler('default' => 'default_h', 'self,tagname,text');
      return $this;
}

The start handler picks up its arguments and then has some code like this:

     } elsif ($tagname eq 'a' && $attr->{href} && 
substr($attr->{href}, 0, 1) ne '#') {
         my $uri = URI->new_abs(lc($attr->{href}), $this->{urldir});
         my $href = $uri->canonical();
         my $base = quotemeta($this->{urldir});
         my $gotmatch;

         # we have an href, and it's not a self reference.
         #!! Note, currently we don't catch the case of a fully 
qualified self reference


Of course you are going to have to do things like remember the URL of 
the page you just fetched so that you can make the href's all 
absolute.  Look at the URI module for that:

                     $uri = URI->new_abs(lc($attr->{href}), $this->{urldir});
                     $href = $uri->canonical();

That'll take the href we got, the directory of the current page and 
result in an absolute url for the href.



Hope that helps.  Of course, without much programming background it's tough.

The pieces I all described are actually part of a project I'm 
currently working on for a client.  When it's done I plan on 
releasing it as a general web page parser, but it's nowhere near 
ready for it yet.
- -- 

Kee Hinckley - Somewhere.Com, LLC - Cyberspace Architects
Now Playing - Folk, Rock, odd stuff - http://www.somewhere.com/playlist.cgi
Now Writing - Technosocial buzz - http://commons.somewhere.com/buzz/

I'm not sure which upsets me more: that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBOszk8yZsPfdw+r2CEQK5zgCfTB+Lnn9mshMNpBInWlUjRGIO5KMAoLJi
x4ngPMmptsA/65ct4tvfyR4Q
=mrtS
-----END PGP SIGNATURE-----
RE: [Boston.pm] multiline search and replace

Reply via email to