[Tutor] Using urllib to retrieve info

2005-08-08 Thread David Holland
Kent,

Sorry I should have put my code.
This is what I wrote
import urllib
import urllib2
f =
urllib.urlopen("http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";).read()
newfile = open("newfile.html",'w')
newfile.write(f)
newfile.close()
print 'finished'

It runs fine but the file saved to disk is the
information at : 'http://support.mywork.co.uk'
not
'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";'
(By the way the real url is
http://support.mywork.co.uk
as I don't what my boss to know I am doing this, in
case I can't not get it to work, in which case he will
say a waste of time.  Of course if I do get it to work
it will not be a waste of time).


The bizarre thing is that for an address like :-
'http://www.linuxquestions.org/questions/showthread.php?s=&threadid=316298'.

It works okay.

Does that make sense.
Thanks everyone for the help so far.





___ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail 
http://uk.messenger.yahoo.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-08 Thread Kent Johnson
David Holland wrote:
> Kent,
> 
> Sorry I should have put my code.
> This is what I wrote
> import urllib
> import urllib2
> f =
> urllib.urlopen("http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";).read()
> newfile = open("newfile.html",'w')
> newfile.write(f)
> newfile.close()
> print 'finished'
> 
> It runs fine but the file saved to disk is the
> information at : 'http://support.mywork.co.uk'
> not
> 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";'

There is something strange with this site that has nothing to do with Python. 
Just playing around with the two URLs in Firefox I get different results by 
reloading the page. If I try loading the two URLs with curl, I get the same 
thing for both.

Possibly there is something going on with cookies; you might take a look at 
urllib2 and its support for cookies to see if you can get it working the way 
you want.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-08 Thread Alan G
> It runs fine but the file saved to disk is the
> information at : 'http://support.mywork.co.uk'
> not
> 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";'

Could there be cookies involved?

Just a thought,

Alan G. 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-08 Thread Liam Clarke-Hutchinson
Hi all, 

David, are you able to send us a screen shot of what you're trying to get? 
>From your desired link I just get a bunch of ads with another ad telling me
it's for sale. 
When I open http://support.mywork.co.uk it looks exactly the same as 
http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objec
tid=21897

So yah, are there cookies, it looks like the site lost it's domain to me.

You may also find that urllib's User Agent setting causes some sites to flat
out spit the dummy.
Try google.com with urllib without changing the user agent, google gets all
huffy when rival bots attempt to spider it. 

I believe you can change the User-Agent using urllib, 
so you can easily masquerade as Internet Explorer or a Mozilla browser. 

>From http://docs.python.org/lib/module-urllib.html

"""
By default, the URLopener class sends a User-Agent: header of "urllib/VVV",
where VVV is the urllib version number. 
Applications can define their own User-Agent: header by subclassing
URLopener or FancyURLopener and 
setting the class attribute version to an appropriate string value in the
subclass definition.
"""

That's another caveat when using urllib. 

But yeah, digression aside, short answer is, I think your specified resource
is dead, and a url-camper has
taken all urls for that site and redirected them to a pop-up fest. 

Regards, 


Liam Clarke-Hutchinson

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf
Of Alan G
Sent: Tuesday, 9 August 2005 6:33 a.m.
To: David Holland; tutor python
Subject: Re: [Tutor] Using urllib to retrieve info


> It runs fine but the file saved to disk is the
> information at : 'http://support.mywork.co.uk'
> not 
> 'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=2039
> 7&objectid=21897"'

Could there be cookies involved?

Just a thought,

Alan G. 

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

A new monthly electronic newsletter covering all aspects of MED's work is now 
available.  Subscribers can choose to receive news from any or all of seven 
categories, free of charge: Growth and Innovation, Strategic Directions, Energy 
and Resources, Business News, ICT, Consumer Issues and Tourism.  See 
http://news.business.govt.nz for more details.




http://www.govt.nz - connecting you to New Zealand central & local government 
services

Any opinions expressed in this message are not necessarily those of the 
Ministry of Economic Development. This message and any files transmitted with 
it are confidential and solely for the use of the intended recipient. If you 
are not the intended recipient or the person responsible for delivery to the 
intended recipient, be advised that you have received this message in error and 
that any use is strictly prohibited. Please contact the sender and delete the 
message and any attachment from your computer.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-09 Thread David Holland
Alan,

Sorry of course that is the problem.  These pages are
password protected  
Is it possible to download password protected pages (I
know the password but I don't how to get the program
to use it).

David
--- Alan G <[EMAIL PROTECTED]> wrote:

> > It runs fine but the file saved to disk is the
> > information at : 'http://support.mywork.co.uk'
> > not
> >
>
'http://support.mywork.co.uk/index.php?node=2371&pagetree=&fromid=20397&objectid=21897";'
> 
> Could there be cookies involved?
> 
> Just a thought,
> 
> Alan G. 
> 
> 




___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-09 Thread Alan G
> Sorry of course that is the problem.  These pages are
> password protected  
> Is it possible to download password protected pages (I
> know the password but I don't how to get the program
> to use it).

That will depend on how the protection is implemented.
If your server is a J2EE box with full blown Kerberos 
security it may be almost impossible. If it's a more 
conventional server like apache which is protecting 
a file or folder then you may only have to submit 
the username and password - ie login! -  or build 
that into a cookie.

If the protection is done by the application itself then 
it could be anywhere in between those two extremes.

HTH,

Alan G.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-09 Thread Kent Johnson
David Holland wrote:
> Alan,
> 
> Sorry of course that is the problem.  These pages are
> password protected  
> Is it possible to download password protected pages (I
> know the password but I don't how to get the program
> to use it).

urllib2 supports basic and digest authentication. There ale examples using 
basic auth here:
http://docs.python.org/lib/urllib2-examples.html
http://www.voidspace.org.uk/python/recipebook.shtml#auth

If the server uses form-based auth I think you will have to post to the form 
yourself and maybe set up urllib2 to handle cookies as well. Look at the login 
form and you may be able to figure out how to post the results directly. The 
second link above also has a cookie example.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Using urllib to retrieve info

2005-08-09 Thread David Holland
I had a look at urllib2 and I found this example :-

import urllib2
# Create an OpenerDirector with support for Basic HTTP
Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password('realm', 'host', 'username',
'password')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with
urlopen.
urllib2.install_opener(opener)
urllib2.urlopen('http://www.example.com/login.html')



One question, what does realm refer to and also what
does it mean by host ?

Thanks in advance.
--- Alan G <[EMAIL PROTECTED]> wrote:

> > Sorry of course that is the problem.  These pages
> are
> > password protected  
> > Is it possible to download password protected
> pages (I
> > know the password but I don't how to get the
> program
> > to use it).
> 
> That will depend on how the protection is
> implemented.
> If your server is a J2EE box with full blown
> Kerberos 
> security it may be almost impossible. If it's a more
> 
> conventional server like apache which is protecting 
> a file or folder then you may only have to submit 
> the username and password - ie login! -  or build 
> that into a cookie.
> 
> If the protection is done by the application itself
> then 
> it could be anywhere in between those two extremes.
> 
> HTH,
> 
> Alan G.
> 




___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor