On Sun, 2011-01-16 at 04:48 -0800, Carl Wells wrote: > Hi, > > I hope you don't mind my newbie question. I'm new to web-programming (and > indeed am somewhat rusty with programming in general). I'm out of work and > trying to teach myself C++, PERL, SQL and other skills and in order to do > this I've set myself a project. As part of this project I need to access > data from this URL: > > http://www.reuters.com/finance/stocks/incomeStatement/detail?perType=ANN&symbol=BATS.L > > the problem I'm having is that this redirects to the reuters.com login page. > I've tried to use both existing cookie files from internet explorer (I had to > rename these because the name of the cookie involved my user name which > incorporates a space and an @ e.g. fred bumble...@honeypot.org and Perl > didn't seem to like that/my syntax was wrong) and setting up perl to receive > a new cookie from the site. Neither has worked for me. I've spent the past > 3 days trying to glue bits of code together from various googles and the cpan > module descriptions for LWP and Mechanize. An example of code thats not > working for me is as below: > > #!/usr/local/bin/perl -w > use strict; > use Crypt::SSLeay; > use LWP::UserAgent; > use LWP::Simple; > use HTTP::Request::Common qw(POST); > use HTTP::Cookies; > > my $ua = LWP::UserAgent->new; > my $cookie_jar = HTTP::Cookies->new(file => "lwpcookies2.txt", > autosave => 1); > $ua->cookie_jar( $cookie_jar); > $ua->agent('Mozilla/5.0'); > my $url = 'https://commerce.us.reuters.com/login/pages/login/login.do'; > my $req = POST $url, ['login' => 'Fredbumblebee', 'password' => 'BzzZZZ!']; > my $res = $ua->request($req); > $cookie_jar->extract_cookies($res); > > if ($res->is_success) { > # print out result to look at headers > print $res->as_string; > > # access page with cookie secured after logged in > my $req = HTTP::Request->new(GET => > 'http://www.reuters.com/finance/stocks/incomeStatement/detail?perType=ANN&symbol=BATS.L'); > $cookie_jar->add_cookie_header($req); > $res = $ua->request($req); > #print $res->as_string; > } else { > print "Failed: ", $res->status_line, "\n"; > } > > The cookie file only contains #LWP-Cookies-1.0. I'm currently trying to use > the Live HTTP Headers addon in firefox to figure out what is being passed to > and from the web server but I am a bit out of my depth :(. > > Once I've done this for BATS I'm planning to get a few more pages for other > stocks so I'm guessing I'll want to create a session, not create a new > cookie/log in again for each page request! I also don't want to hammer their > site, I gather one can use a 'sleep' command, do you have any advice on this? > > I've managed to use HTML::tableextract to get tables I want from other > reuters.com pages which didn't require the free logon but no joy here! I > started using C++/CURL/tidylib/tinyxml but moved to PERL as its so much > easier to use! Once I have done this I'll want to call PERL from C++ so that > I can pass my data into C++ objects; I've already looked into this and am > finding it tricky (running a simple perl script from C++ is fine but calling > PERL with modules such as LWP has not worked for me yet; I've read the docs > but not managed to get the XS thing to run, Perl was saying it couldn't run > dynamic code in this way; does anyone know a good, easy to use Perl Wrapper > for C++?? there are several but they all seem to be from 2003!! and not sure > they will work) > > If some kind soul would help me out or even suggest what I might need to read > to find my solution that would be very much appreciated!! > > Thanks, > > Carl > > > > > > Hi Carl , if I read your post correctly , your trying to scrape a website of some data using the Perl LWP methods , It is a common task for Perl , May I suggest that you do some research on Scrapping and Perl , you will find that there are several approaches to navigating the target site , your user agent should be able to respond to login request from the target site , and proceed to the next page the site presents as well as make selections from drop down box's and fill in text entry fields and press the submit buttons, check with perl.com for some tutorials. WWW::Mechanize may be the module your looking for. or a combination of LWP::UserAgent and the Perl Expect.pm may be hacked together.
I have found when I last wrote a scraping script it helped to manually walk through each and every step , look at the source of each page and record the form widgets names and what they were suppose to contain , then reproduce the same experience programing with the script hope this helps Greg -- To unsubscribe, e-mail: beginners-cgi-unsubscr...@perl.org For additional commands, e-mail: beginners-cgi-h...@perl.org http://learn.perl.org/