-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
At 9:57 AM -0400 4/5/01, Van Schooenderwoert, Nancy wrote:
>I'd be interested in seeing how to do the below... can you post a code
>snippet to the list, or is it on a webpage? (I've been away - am catching up
>on prev. PerlMonger mail).
I'm writing a screen-scraper right now that takes pages off of one
web site and puts them on another (two sites at the same company,
don't ask).
Anyway, someone asked me how to do a crawler out of the blue (I mean
out of the blue, it was a random email from someone to
[EMAIL PROTECTED]), and I sent them this. I have no idea if
it'll help you, but....
When I'm done with it I plan on cleaning it up a bit and releasing
it. They way I have it set up you only have to write a few dozen
lines to be able to get data from a new web page--but right now it's
all specific to the client's web site.
At 3:22 AM +0100 3/30/01, Majid Ali wrote:
>i need to create a simple web crawler in perl and i am finding it difficult
>what steps need to be taken when creating one ..i dont have much programming
>background...will i be able to download one off the internet
>
>
>thanks
Well, I have no idea why you are asking [EMAIL PROTECTED], but....
Take a look at LWP::UserAgent and HTML::Parser
In particular:
$UA = new LWP::UserAgent;
$Cookies = new HTTP::Cookies;
$UA->cookie_jar($Cookies);
$UA->timeout(60);
$UA->agent("Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)");
foreach $key (keys %LoginData) {
$content .= uri_escape($key) . '=' .
uri_escape($LoginData{$key}) . '&';
}
chop $content;
$req = new HTTP::Request POST => $LoginURL;
$req->content($content);
$req->content_type('application/x-www-form-urlencoded');
$req->content_length(length($content));
$res = $UA->request($req);
if (!$res->is_redirect) {
die("$url: Login Failed\n" . $res->error_as_HTML());
}
}
$req = new HTTP::Request GET => $url;
$req->referer($DefaultURL);
while (($head, $value) = each(%headers)) {
next if ($AlwaysFetch && lc($head) eq 'if-modified-since');
$req->header($head, $value);
}
$res = $UA->request($req);
if ($res->is_success) {
if (wantarray()) {
return (IO::Scalar->new_tie($res->content_ref), $res->base());
} else {
return IO::Scalar->new_tie($res->content_ref);
}
You won't want all of that, but that gives you an idea of how to
fetch the page, the second part is to parse it:
For that create a subclass of HTML::Parser
use HTML::Parser 3 ();
@ISA = qw(HTML::Parser);
and define some handlers. For crawling all you really need is the
start handler
sub new {
my $pkg = shift;
my $this = $pkg->SUPER::new(api_version => 3, strict_comment => 0,
unbroken_text => 1);
$this->handler('start' => 'start_h', 'self, tagname, attr, attrseq, text');
$this->handler('end' => 'end_h', 'self, tagname, text');
$this->handler('text' => 'text_h', 'self, text');
$this->handler('comment' => 'comment_h', 'self,text');
$this->handler('default' => 'default_h', 'self,tagname,text');
return $this;
}
The start handler picks up its arguments and then has some code like this:
} elsif ($tagname eq 'a' && $attr->{href} &&
substr($attr->{href}, 0, 1) ne '#') {
my $uri = URI->new_abs(lc($attr->{href}), $this->{urldir});
my $href = $uri->canonical();
my $base = quotemeta($this->{urldir});
my $gotmatch;
# we have an href, and it's not a self reference.
#!! Note, currently we don't catch the case of a fully
qualified self reference
Of course you are going to have to do things like remember the URL of
the page you just fetched so that you can make the href's all
absolute. Look at the URI module for that:
$uri = URI->new_abs(lc($attr->{href}), $this->{urldir});
$href = $uri->canonical();
That'll take the href we got, the directory of the current page and
result in an absolute url for the href.
Hope that helps. Of course, without much programming background it's tough.
The pieces I all described are actually part of a project I'm
currently working on for a client. When it's done I plan on
releasing it as a general web page parser, but it's nowhere near
ready for it yet.
- --
Kee Hinckley - Somewhere.Com, LLC - Cyberspace Architects
Now Playing - Folk, Rock, odd stuff - http://www.somewhere.com/playlist.cgi
Now Writing - Technosocial buzz - http://commons.somewhere.com/buzz/
I'm not sure which upsets me more: that people are so unwilling to accept
responsibility for their own actions, or that they are so eager to regulate
everyone else's.
-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>
iQA/AwUBOszk8yZsPfdw+r2CEQK5zgCfTB+Lnn9mshMNpBInWlUjRGIO5KMAoLJi
x4ngPMmptsA/65ct4tvfyR4Q
=mrtS
-----END PGP SIGNATURE-----