Very Basic Web Scrape

2006-04-07 Thread kc68
I'm trying to learn web scraping and am stopped at the basic point of  
scraping a portion
of a web page.  I'm able to scrape a full page and save it as *.xml or  
*.htm, and I think

I understand regex, but the following fails:


**
# Prints a portion of a red cross web page to a new htm file.

use strict;

use warnings;

use LWP::Simple;

use WWW::Mechanize;

my $url =

'http://www.redcrossnca.org/ServiceCenters/montgomery.php3';

getstore( $url, 'c://redcross.htm' );

open PAGE, 'c://redcross.htm';
while( my $line = PAGE ) {
$line =~ /Health and Safety Classes/
print $1\n;
}

close PAGE;


Once I get the syntax straight I'll go after more detailed scrapes.

Ken

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread Oliver Block
Hi,

 I understand regex, but the following fails:
 open PAGE, 'c://redcross.htm';
 while( my $line = PAGE ) {
 $line =~ /Health and Safety Classes/
 print $1\n;
 }

What fails? Your forget a ';' after the regex but I guess that's not what you 
mean!? :)

cu,

Oliver



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread kc68
On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED]  
wrote:



Hi,


I understand regex, but the following fails:
open PAGE, 'c://redcross.htm';
while( my $line = PAGE ) {
$line =~ /Health and Safety Classes/
print $1\n;
}


What fails? Your forget a ';' after the regex but I guess that's not  
what you

mean!? :)

cu,

Oliver


Now that was pretty basic.  So now that script runs, but I get the full  
page.  I was trying to limit the result to the words /Health and Safety  
Classes/ that appear on the page.  How do I get there?


Ken



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread Joshua Colson
On Fri, 2006-04-07 at 16:36 -0400, [EMAIL PROTECTED] wrote:
 On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED]  
 wrote:
 
  Hi,
 
  I understand regex, but the following fails:
  open PAGE, 'c://redcross.htm';
  while( my $line = PAGE ) {
  # $line =~ /Health and Safety Classes/
  # print $1\n;

print $1\n if $line =~ /Health and Safety Classes/;

  }
 
  What fails? Your forget a ';' after the regex but I guess that's not  
  what you
  mean!? :)
 
  cu,
 
  Oliver
 
 
 Now that was pretty basic.  So now that script runs, but I get the full  
 page.  I was trying to limit the result to the words /Health and Safety  
 Classes/ that appear on the page.  How do I get there?
 
 Ken
 
 
 


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread Oliver Block
Am Freitag, 7. April 2006 22:36 schrieb [EMAIL PROTECTED]:

 I was trying to limit the result to the words /Health and Safety
 Classes/ that appear on the page.  How do I get there?

At first you need to understand regex! :)

 open PAGE, 'c://redcross.htm';
 while( my $line = PAGE ) {
 $line =~ /Health and Safety Classes/
 print $1\n;
 }

$1 has no value because you did not use groups.

The esiest - based on you code - is:

while( PAGE ) {
   print if /Health and Safety Classes/;
}

Look after:

perldoc perlopentut
perldoc perlretut
perldoc perlre
perldoc LWP

cu,

Oliver



-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread Dave Gray
On 4/7/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED]
 wrote:
  I understand regex, but the following fails:
  open PAGE, 'c://redcross.htm';
  while( my $line = PAGE ) {
  $line =~ /Health and Safety Classes/
  print $1\n;
  }
 
  What fails? Your forget a ';' after the regex but I guess that's not
  what you
  mean!? :)
 
 Now that was pretty basic.  So now that script runs, but I get the full
 page.  I was trying to limit the result to the words /Health and Safety
 Classes/ that appear on the page.  How do I get there?

$1 refers to the first parenthesized group in your regular expression.
So if you change your code to look like:

  if ($line =~ /(Health and Safety Classes)/) {
print found a match: [$1]\n;
  }

then $1 will refer to the literal text inside the parens. Obviously,
$1 matches more interesting things once you include some
meta-characters.

http://perldoc.perl.org/perlre.html

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Very Basic Web Scrape

2006-04-07 Thread Jaime Murillo
On Friday 07 April 2006 13:15, [EMAIL PROTECTED] wrote:
 I'm trying to learn web scraping and am stopped at the basic point of
 scraping a portion
 of a web page.  I'm able to scrape a full page and save it as *.xml or
 *.htm, and I think
 I understand regex, but the following fails:


 **
 # Prints a portion of a red cross web page to a new htm file.

 use strict;

 use warnings;

 use LWP::Simple;

 use WWW::Mechanize;

 my $url =

 'http://www.redcrossnca.org/ServiceCenters/montgomery.php3';

 getstore( $url, 'c://redcross.htm' );

 open PAGE, 'c://redcross.htm';
 while( my $line = PAGE ) {
 $line =~ /Health and Safety Classes/
 print $1\n;
 }

 close PAGE;
 

 Once I get the syntax straight I'll go after more detailed scrapes.

 Ken

Have you looked into HTML::TokeParser. It might help you
in your web scraping needs.  You can read a great article by
Chris Ball at:

http://www.perl.com/pub/a/2003/01/22/mechanize.html

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response