[Tutor] Using urllib to retrieve info
Kent, Sorry I should have put my code. This is what I wrote import urllib import urllib2 f = urllib.urlopen("http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";).read() newfile = open("newfile.html",'w') newfile.write(f) newfile.close() print 'finished' It runs fine but the file saved to disk is the information at : 'http://support.mywork.co.uk' not 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";' (By the way the real url is http://support.mywork.co.uk as I don't what my boss to know I am doing this, in case I can't not get it to work, in which case he will say a waste of time. Of course if I do get it to work it will not be a waste of time). The bizarre thing is that for an address like :- 'http://www.linuxquestions.org/questions/showthread.php?s=&threadid=316298'. It works okay. Does that make sense. Thanks everyone for the help so far. ___ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
David Holland wrote: > Kent, > > Sorry I should have put my code. > This is what I wrote > import urllib > import urllib2 > f = > urllib.urlopen("http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";).read() > newfile = open("newfile.html",'w') > newfile.write(f) > newfile.close() > print 'finished' > > It runs fine but the file saved to disk is the > information at : 'http://support.mywork.co.uk' > not > 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";' There is something strange with this site that has nothing to do with Python. Just playing around with the two URLs in Firefox I get different results by reloading the page. If I try loading the two URLs with curl, I get the same thing for both. Possibly there is something going on with cookies; you might take a look at urllib2 and its support for cookies to see if you can get it working the way you want. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
> It runs fine but the file saved to disk is the > information at : 'http://support.mywork.co.uk' > not > 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";' Could there be cookies involved? Just a thought, Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
Hi all, David, are you able to send us a screen shot of what you're trying to get? >From your desired link I just get a bunch of ads with another ad telling me it's for sale. When I open http://support.mywork.co.uk it looks exactly the same as http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objec tid=21897 So yah, are there cookies, it looks like the site lost it's domain to me. You may also find that urllib's User Agent setting causes some sites to flat out spit the dummy. Try google.com with urllib without changing the user agent, google gets all huffy when rival bots attempt to spider it. I believe you can change the User-Agent using urllib, so you can easily masquerade as Internet Explorer or a Mozilla browser. >From http://docs.python.org/lib/module-urllib.html """ By default, the URLopener class sends a User-Agent: header of "urllib/VVV", where VVV is the urllib version number. Applications can define their own User-Agent: header by subclassing URLopener or FancyURLopener and setting the class attribute version to an appropriate string value in the subclass definition. """ That's another caveat when using urllib. But yeah, digression aside, short answer is, I think your specified resource is dead, and a url-camper has taken all urls for that site and redirected them to a pop-up fest. Regards, Liam Clarke-Hutchinson -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Alan G Sent: Tuesday, 9 August 2005 6:33 a.m. To: David Holland; tutor python Subject: Re: [Tutor] Using urllib to retrieve info > It runs fine but the file saved to disk is the > information at : 'http://support.mywork.co.uk' > not > 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=2039 > 7&objectid=21897"' Could there be cookies involved? Just a thought, Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor A new monthly electronic newsletter covering all aspects of MED's work is now available. Subscribers can choose to receive news from any or all of seven categories, free of charge: Growth and Innovation, Strategic Directions, Energy and Resources, Business News, ICT, Consumer Issues and Tourism. See http://news.business.govt.nz for more details. http://www.govt.nz - connecting you to New Zealand central & local government services Any opinions expressed in this message are not necessarily those of the Ministry of Economic Development. This message and any files transmitted with it are confidential and solely for the use of the intended recipient. If you are not the intended recipient or the person responsible for delivery to the intended recipient, be advised that you have received this message in error and that any use is strictly prohibited. Please contact the sender and delete the message and any attachment from your computer. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
Alan, Sorry of course that is the problem. These pages are password protected Is it possible to download password protected pages (I know the password but I don't how to get the program to use it). David --- Alan G <[EMAIL PROTECTED]> wrote: > > It runs fine but the file saved to disk is the > > information at : 'http://support.mywork.co.uk' > > not > > > 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";' > > Could there be cookies involved? > > Just a thought, > > Alan G. > > ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
> Sorry of course that is the problem. These pages are > password protected > Is it possible to download password protected pages (I > know the password but I don't how to get the program > to use it). That will depend on how the protection is implemented. If your server is a J2EE box with full blown Kerberos security it may be almost impossible. If it's a more conventional server like apache which is protecting a file or folder then you may only have to submit the username and password - ie login! - or build that into a cookie. If the protection is done by the application itself then it could be anywhere in between those two extremes. HTH, Alan G. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
David Holland wrote: > Alan, > > Sorry of course that is the problem. These pages are > password protected > Is it possible to download password protected pages (I > know the password but I don't how to get the program > to use it). urllib2 supports basic and digest authentication. There ale examples using basic auth here: http://docs.python.org/lib/urllib2-examples.html http://www.voidspace.org.uk/python/recipebook.shtml#auth If the server uses form-based auth I think you will have to post to the form yourself and maybe set up urllib2 to handle cookies as well. Look at the login form and you may be able to figure out how to post the results directly. The second link above also has a cookie example. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Using urllib to retrieve info
I had a look at urllib2 and I found this example :- import urllib2 # Create an OpenerDirector with support for Basic HTTP Authentication... auth_handler = urllib2.HTTPBasicAuthHandler() auth_handler.add_password('realm', 'host', 'username', 'password') opener = urllib2.build_opener(auth_handler) # ...and install it globally so it can be used with urlopen. urllib2.install_opener(opener) urllib2.urlopen('http://www.example.com/login.html') One question, what does realm refer to and also what does it mean by host ? Thanks in advance. --- Alan G <[EMAIL PROTECTED]> wrote: > > Sorry of course that is the problem. These pages > are > > password protected > > Is it possible to download password protected > pages (I > > know the password but I don't how to get the > program > > to use it). > > That will depend on how the protection is > implemented. > If your server is a J2EE box with full blown > Kerberos > security it may be almost impossible. If it's a more > > conventional server like apache which is protecting > a file or folder then you may only have to submit > the username and password - ie login! - or build > that into a cookie. > > If the protection is done by the application itself > then > it could be anywhere in between those two extremes. > > HTH, > > Alan G. > ___ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor