Very Basic Web Scrape
I'm trying to learn web scraping and am stopped at the basic point of scraping a portion of a web page. I'm able to scrape a full page and save it as *.xml or *.htm, and I think I understand regex, but the following fails: ** # Prints a portion of a red cross web page to a new htm file. use strict; use warnings; use LWP::Simple; use WWW::Mechanize; my $url = 'http://www.redcrossnca.org/ServiceCenters/montgomery.php3'; getstore( $url, 'c://redcross.htm' ); open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } close PAGE; Once I get the syntax straight I'll go after more detailed scrapes. Ken -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
Hi, I understand regex, but the following fails: open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } What fails? Your forget a ';' after the regex but I guess that's not what you mean!? :) cu, Oliver -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED] wrote: Hi, I understand regex, but the following fails: open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } What fails? Your forget a ';' after the regex but I guess that's not what you mean!? :) cu, Oliver Now that was pretty basic. So now that script runs, but I get the full page. I was trying to limit the result to the words /Health and Safety Classes/ that appear on the page. How do I get there? Ken -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
On Fri, 2006-04-07 at 16:36 -0400, [EMAIL PROTECTED] wrote: On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED] wrote: Hi, I understand regex, but the following fails: open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { # $line =~ /Health and Safety Classes/ # print $1\n; print $1\n if $line =~ /Health and Safety Classes/; } What fails? Your forget a ';' after the regex but I guess that's not what you mean!? :) cu, Oliver Now that was pretty basic. So now that script runs, but I get the full page. I was trying to limit the result to the words /Health and Safety Classes/ that appear on the page. How do I get there? Ken -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
Am Freitag, 7. April 2006 22:36 schrieb [EMAIL PROTECTED]: I was trying to limit the result to the words /Health and Safety Classes/ that appear on the page. How do I get there? At first you need to understand regex! :) open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } $1 has no value because you did not use groups. The esiest - based on you code - is: while( PAGE ) { print if /Health and Safety Classes/; } Look after: perldoc perlopentut perldoc perlretut perldoc perlre perldoc LWP cu, Oliver -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
On 4/7/06, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: On Fri, 07 Apr 2006 16:02:53 -0400, Oliver Block [EMAIL PROTECTED] wrote: I understand regex, but the following fails: open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } What fails? Your forget a ';' after the regex but I guess that's not what you mean!? :) Now that was pretty basic. So now that script runs, but I get the full page. I was trying to limit the result to the words /Health and Safety Classes/ that appear on the page. How do I get there? $1 refers to the first parenthesized group in your regular expression. So if you change your code to look like: if ($line =~ /(Health and Safety Classes)/) { print found a match: [$1]\n; } then $1 will refer to the literal text inside the parens. Obviously, $1 matches more interesting things once you include some meta-characters. http://perldoc.perl.org/perlre.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Very Basic Web Scrape
On Friday 07 April 2006 13:15, [EMAIL PROTECTED] wrote: I'm trying to learn web scraping and am stopped at the basic point of scraping a portion of a web page. I'm able to scrape a full page and save it as *.xml or *.htm, and I think I understand regex, but the following fails: ** # Prints a portion of a red cross web page to a new htm file. use strict; use warnings; use LWP::Simple; use WWW::Mechanize; my $url = 'http://www.redcrossnca.org/ServiceCenters/montgomery.php3'; getstore( $url, 'c://redcross.htm' ); open PAGE, 'c://redcross.htm'; while( my $line = PAGE ) { $line =~ /Health and Safety Classes/ print $1\n; } close PAGE; Once I get the syntax straight I'll go after more detailed scrapes. Ken Have you looked into HTML::TokeParser. It might help you in your web scraping needs. You can read a great article by Chris Ball at: http://www.perl.com/pub/a/2003/01/22/mechanize.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response