scraping a site..

Thomas, Mark - BLS CTR Fri, 28 Oct 2005 07:48:37 -0700

Bruce,

Hehe... Gotta hand it to ya, you posted code this time :)


Here's what's going on. WWW::CheckSite::Spider is using the Mech class you
specify. You specified a class by the name of BA_Mech. So what you have to
do in your BA_Mech class is provide the information you need to log in. Make
BA_Mech a subclass of WWW::Mechanize and override the method(s) you would
need to change. In the case of the standard HTTP server-based Basic
Authentication, you would only need to provide a get_basic_credentials
method with your username and password. But in your case, you'd probably
need to override new() to call SUPER::new then perform the form-based login
before returning the Mech object's handle.

If you're not familiar with OO and subclassing, you will probably find it
easier to use my second suggestion, which was to perform
$mech->extract_links on each page and put the links in an array or hash to
keep track of which ones you still need to follow. Then you follow them
until you're done.

If the result of all this is to mirror pages and /then/ scrape, you're
making it too complicated. Skip the mirroring, and scrape as you fetch.

- Mark.



> -----Original Message-----
> From: bruce [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, October 27, 2005 5:04 PM
> To: Thomas, Mark - BLS CTR; 'perl-win32-users mailing list'
> Subject: RE: spidering/crawling/scraping a site..
> 
> hi...
> 
> decided to try to use the www::checksite::spider to try to 
> create/write a
> quick spider for the http://jobboardsoftware.biz/demo/admin/login.apsx
> site...
> 
> i blew it!!!
> 
> the following code gives me some sort of hash, but i'm pretty 
> sure i haven't
> correctly filled in the login (user/passwd) form correctly...
> 
> any thoughts??
> 
> i'm not exactly sure what the BA_Mech is doing, or why it 
> might be needed. i
> tried to do the form submit directly from the spider, but the 
> perl code
> threw an error..
> 
> so, basically, i'm guessing!!!
> 
> 
> package BA_Mech;
> use base 'WWW::Mechanize';
> 
> $mech = WWW::Mechanize->new();
> my $start1 = "http://jobboardsoftware.biz/demo/admin/login.aspx";;
> $mech->get($start1);
> $mech->submit_form(
>        form_name => 'Form1',
>        button    => 'Button1',
>        fields => {
>                username => 'demo',
>                password => 'demo'
>                  }
>        );
> 
> 
> package Main;
> use WWW::CheckSite::Spider;
> 
> my $start = "http://jobboardsoftware.biz/demo/admin/login.aspx";;
> 
> my $sp = WWW::CheckSite::Spider->new(
>         ua_class => 'BA_Mech',
>         uri => $start,
> );
> 
> while (my $page = $sp->get_page)
> {
> print $page;
> print "\n";
> }
> die;
> 
> -bruce
> 
> 
> 
> 
> -----Original Message-----
> From: Thomas, Mark - BLS CTR [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 27, 2005 10:34 AM
> To: '[EMAIL PROTECTED]'
> Subject: RE: spidering/crawling/scraping a site..
> 
> 
> OK, there's a difference between using standard HTTP 
> authentication (the
> browser dialog box) and form-based authentication (which every site
> implements differently). If it is the former, most mirroring 
> tools can do it
> already. But if authentication is done through the web app like your
> example, you'll have to use Mech.
> 
> - Mark.
> 
> > -----Original Message-----
> > From: bruce [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, October 27, 2005 1:22 PM
> > To: Thomas, Mark - BLS CTR
> > Subject: RE: spidering/crawling/scraping a site..
> >
> > thanks!!!
> >
> > i would have thought that there would have been a bunch of
> > little/big apps
> > that were used to parse user/passwd protected login fom sites...
> >
> > guess i was wrong..
> >
> >
> > but this isn't for some evil/take over the world project.
> > i've got a few
> > sites that i'm looking at, that are passwd/login protected.
> > rather than
> > loign to the sites, i thought i'd scrape them, and then compare them
> > locally, when i wanted. which is why i was saying i didn't
> > think this was
> > going to be more than a few minutes....
> >
> > -bruce
> >
> >
> > -----Original Message-----
> > From: Thomas, Mark - BLS CTR [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, October 27, 2005 9:51 AM
> > To: '[EMAIL PROTECTED]'
> > Subject: RE: spidering/crawling/scraping a site..
> >
> >
> > Oh, you want a MIRRORING app that will mirror stuff that you
> > have to log in
> > to get! Correct terminology is everything. What nefarious
> > purposes do you
> > want that for?
> >
> > What you need is something that can use a Mech object to
> > spider with. You
> > get the Mech object logged in, then pass it to the spider,
> > which does its
> > dirty deed.
> >
> > You're probably the only person in the world that wants to do
> > that. So you
> > probably won't find anything that does it out-of-the-box, but
> > I found a
> > module that uses a Mech object to mirror a site:
> > WWW::CheckSite::Spider. It
> > only takes a start URI, but it's probably a small
> > modification to make it
> > accept a Mech object that is already logged in. Then you'd be
> > able to scrape
> > all the pages.
> >
> > Of course writing a Mech spider yourself is an option, and I
> > say it would be
> > simple, as there is already an extract_links() function. Just
> > do that on
> > every page you visit, push the links onto a @links_to_check
> > array, and keep
> > fetching until the array is empty! Probably a 5 minute task.
> >
> > - Mark.
> >
> > > -----Original Message-----
> > > From: bruce [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, October 27, 2005 12:27 PM
> > > To: Thomas, Mark - BLS CTR
> > > Subject: RE: spidering/crawling/scraping a site..
> > >
> > > mark,
> > >
> > > i already new the mech part.. and i know i can write/create a
> > > crawler. but
> > > that wouldn't take the 5 mins i thought this task would take!!!
> > >
> > > i was looking for a solution that may have already been
> > > created, which was
> > > the initial post.
> > >
> > > i had thought wget would have been suitable, but it has no
> > > provision for the
> > > user/passwd form. i also thought about using a perl script
> > > with mech, and
> > > then calling wget to allow the rest of the site to be
> > > crawled.. didn't work
> > > either... so i was looking for an actual crawling app.. if i
> > > could find a
> > > quick/easy one, i can modify it for my needs, as opposed to
> > > writing one...
> > >
> > > this is what i was looking for, but i really do appreciate
> > > your assistance!!
> > >
> > > -bruce
> > >
> > >
> > > -----Original Message-----
> > > From: Thomas, Mark - BLS CTR [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, October 27, 2005 9:21 AM
> > > To: '[EMAIL PROTECTED]'
> > > Subject: RE: spidering/crawling/scraping a site..
> > >
> > >
> > >
> > > > but thanks for the laugh! i was referring to a generalizable
> > > > app that i
> > > > could modify to crawl through the site to get the underlying
> > > > information, as opposed to the one page.
> > >
> > > Bruce, I'm telling you, Mechanize can get what you want to
> > > get. Trust me.
> > >
> > > I've added ONE LINE and now it gets the statistics page. See
> > > below. Parse
> > > what you want out of it.
> > >
> > > You should try Mechanize! You'll like it! Seriously, it's
> > > like a browser
> > > with a remote control. You can do ANYTHING a browser can do.
> > >
> > > P.S.
> > > I recommend XML::LibXML to parse the HTML, because it makes
> > > extracting the
> > > information from HTML very easy. For example, grabbing the
> > > "Avg. Job Post
> > > Duration" would be this line:
> > >
> > > print $page->findvalue('//[EMAIL PROTECTED]'); #prints
> > > "57.5 days"
> > >
> > >
> > > P.P.S.
> > > It seems every few weeks you pop up on the list and ask a
> > > question for which
> > > the answer is WWW::Mechanize. And that's what I tell you. In
> > > the future,
> > > unless you post Mechanize code you need help with, I'M NOT
> > > GOING TO HELP.
> > > This is getting tedious.
> > >
> > > P.P.P.S.
> > >
> > > Here's the code with the added line you need to get to the
> > > statistics page.
> > > You can get to other pages too, so don't even think about
> > > asking *that*
> > > question! X-/
> > >
> > > #!/usr/bin/perl -w
> > >
> > > use WWW::Mechanize;
> > > my $start_url = 
> 'http://jobboardsoftware.biz/demo/admin/login.aspx';
> > >
> > > my $mech = WWW::Mechanize->new();
> > > $mech->get($start_url);
> > > $mech->submit_form(
> > >     form_name => 'Form1',
> > >     button    => 'Button1',
> > >     fields => {
> > >                username=>'demo',
> > >                password=>'demo',
> > >               },
> > >     );
> > > $mech->follow_link( url_regex => qr/statistics/ );
> > > print $mech->content;
> > >
> > >
> >
> >
> 
> 

_______________________________________________
Perl-Win32-Users mailing list
Perl-Win32-Users@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: spidering/crawling/scraping a site..

Reply via email to