subject:"\[Tutor\] urllib"

Re: [Tutor] urllib ... lost novice's question

2017-05-10 Thread Alan Gauld via Tutor

On 10/05/17 17:06, Rafael Knuth wrote:
>>> Then, there is another package, along with a dozen other
>>> urllib-related packages (such as aiourllib).
>>
>> Again, where are you finding these? They are not in
>> the standard library. Have you been installing other
>> packages that may have their own versions maybe?
> 
> they are all available via PyCharm EDU

It looks like PyCharm may be adding extra packages to
the standard library. Thats OK, both ActiveState and
Anaconda (and others) do the same, but it does mean
you need to check on python.org to see what is and
what isn't "approved".

If it's not official content then you need to ask on
a PyCharm forum about the preferred choices. The fact
they are included suggests that somebody has tested
them and found them useful in some way, but you would
need to ask them why they chose those packages and
when they would be more suitable than the standard
versions.

These bonus packages are often seen as a valuable
extra, but they do carry a burden of responsibility
for the user to identify which is best for them,
and that's not always easy to assess, especially
for a beginner.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib ... lost novice's question

2017-05-10 Thread Rafael Knuth

>> Then, there is another package, along with a dozen other
>> urllib-related packages (such as aiourllib).
>
> Again, where are you finding these? They are not in
> the standard library. Have you been installing other
> packages that may have their own versions maybe?

they are all available via PyCharm EDU
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib ... lost novice's question

2017-05-09 Thread Mats Wichmann

this is one of those things where if what you want is simple, they're all 
usable, and easy. if not, some are frankly horrid.

requests is the current hot module. go ahead and try it. (urllib.request is not 
from requests, it's from urllib)

On May 8, 2017 9:23:15 AM MDT, Rafael Knuth  wrote:
>Which package should I use to fetch and open an URL?
>I am using Python 3.5 and there are presently 4 versions:
>
>urllib2
>urllib3
>urllib4
>urllib5
>
>Common sense is telling me to use the latest version.
>Not sure if my common sense is fooling me here though ;-)
>
>Then, there is another package, along with a dozen other
>urllib-related packages (such as aiourllib). I thought this one is
>doing what I need:
>
>urllib.request
>
>The latter I found on http://docs.python-requests.org along with these
>encouraging words:
>
>"Warning: Recreational use of the Python standard library for HTTP may
>result in dangerous side-effects, including: security vulnerabilities,
>verbose code, reinventing the wheel, constantly reading documentation,
>depression, headaches, or even death."
>
>How do I know where to find the right package - on python.org or
>elsewhere?
>I found some code samples that show how to use urllib.request, now I
>am trying to understand why I should use urllib.request.
>Would it be also doable to do requests using urllib5 or any other
>version? Like 2 or 3? Just trying to understand.
>
>I am lost here. Feeback appreciated. Thank you!
>
>BTW, here's some (working) exemplary code I have been using for
>educational purposes:
>
>import urllib.request
>from bs4 import BeautifulSoup
>
>theurl = "https://twitter.com/rafaelknuth;
>thepage = urllib.request.urlopen(theurl)
>soup = BeautifulSoup(thepage, "html.parser")
>
>print(soup.title.text)
>
>i = 1
>for tweets in soup.findAll("div",{"class":"content"}):
>print(i)
>print(tweets.find("p").text)
>i = i + 1
>
>I am assuming there are different solutions for fetching and open URLs?
>Or is the above the only viable solution?
>___
>Tutor maillist  -  Tutor@python.org
>To unsubscribe or change subscription options:
>https://mail.python.org/mailman/listinfo/tutor

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib ... lost novice's question

2017-05-09 Thread Abdur-Rahmaan Janhangeer

As a side note see a tutorial on urllib and requests and try them at the
same time

see for python 3.x; 3.4 or 3.6

also see the data type received by the different combinations, when you
should use .read() etc

also use utf-8 or unicode like .decode("utf8")

Well play around fool mess with it, feel free as when you'll do serious
stuffs you won't need to test to know what should be done or not, what
breaks it or not.

summary : learn it well from the begining

Finding the right package.

Hum either in your beginner learning path you learn popular third party
modules

or

You find how the people round the net did what you are doing, see how they
did it and what modules they used

or

google "module "

or

browse pypi

or _long term_

never stop reading about python. so you'll constantly discover new things
and reduce the probability of you not knowing how to do something.

Hope it helps,

Abdur-Rahmaan Janhangeer
Vacoas,
Mauritius
https://abdurrahmaanjanhangeer.wordpress.com/
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib ... lost novice's question

2017-05-08 Thread Alan Gauld via Tutor

On 08/05/17 16:23, Rafael Knuth wrote:
> Which package should I use to fetch and open an URL?
> I am using Python 3.5 and there are presently 4 versions:
> 
> urllib2
> urllib3
> urllib4
> urllib5

I don't know where you are getting those from but the
standard install of Python v3.6 only has urllib. This
is a package with various modules inside.

ISTR there was a urllib2 in Python 2 for a while but
I've never heard of any 3,4, or 5.

> Then, there is another package, along with a dozen other
> urllib-related packages (such as aiourllib). 

Again, where are you finding these? They are not in
the standard library. Have you been installing other
packages that may have their own versions maybe?

> urllib.request
> 
> The latter I found on http://docs.python-requests.org along with these
> encouraging words:
> 
> "Warning: Recreational use of the Python standard library for HTTP may
> result in dangerous side-effects, including: security vulnerabilities,
> verbose code, reinventing the wheel, constantly reading documentation,
> depression, headaches, or even death."

That's true of almost any package used badly.

Remember that this is "marketing" propaganda from an
alternative package maintainer. And while most folks
(including me)seem to agree that Requests is easier
to use than the standard library, the standard library
version works just fine if you take sensible care.

> How do I know where to find the right package

There is no right package, just the one you find most effective.
Most folks would say that Requests is easier to use than the
standard library, if you are doing anything non-trivial I'd
second that opinion.

> I found some code samples that show how to use urllib.request, now I
> am trying to understand why I should use urllib.request.

Because as part of the standard library you can be sure
it will be thee, whereas Requests is a third party module
that needs to be downloaded/installed and therefore may
not be present (or even allowed by the server admins)

Or maybe because you found some old code written before
Requests became popular and you need to integrate with
it or reuse it.

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] urllib ... lost novice's question

2017-05-08 Thread Rafael Knuth

Which package should I use to fetch and open an URL?
I am using Python 3.5 and there are presently 4 versions:

urllib2
urllib3
urllib4
urllib5

Common sense is telling me to use the latest version.
Not sure if my common sense is fooling me here though ;-)

Then, there is another package, along with a dozen other
urllib-related packages (such as aiourllib). I thought this one is
doing what I need:

urllib.request

The latter I found on http://docs.python-requests.org along with these
encouraging words:

"Warning: Recreational use of the Python standard library for HTTP may
result in dangerous side-effects, including: security vulnerabilities,
verbose code, reinventing the wheel, constantly reading documentation,
depression, headaches, or even death."

How do I know where to find the right package - on python.org or elsewhere?
I found some code samples that show how to use urllib.request, now I
am trying to understand why I should use urllib.request.
Would it be also doable to do requests using urllib5 or any other
version? Like 2 or 3? Just trying to understand.

I am lost here. Feeback appreciated. Thank you!

BTW, here's some (working) exemplary code I have been using for
educational purposes:

import urllib.request
from bs4 import BeautifulSoup

theurl = "https://twitter.com/rafaelknuth;
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage, "html.parser")

print(soup.title.text)

i = 1
for tweets in soup.findAll("div",{"class":"content"}):
print(i)
print(tweets.find("p").text)
i = i + 1

I am assuming there are different solutions for fetching and open URLs?
Or is the above the only viable solution?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib confusion

2014-11-23 Thread Cameron Simpson


On 21Nov2014 15:57, Clayton Kirkwood c...@godblessthe.us wrote:

Got a general problem with url work. I’ve struggled through a lot of
code which uses urllib.[parse,request]* and urllib2. First q: I read
someplace in urllib documentation which makes it sound like either
urllib or urllib2 modules are being deprecated in 3.5. Don’t know if

it’s only part or whole.

The names of the modules changed I believe in v3.x.


I don't think so because I've seen both lib and lib2 in both new and old code, 
and current 4.3 documentation talks only of urllib.


You mean 3.4 I would hope.

It is clear from this:

 https://docs.python.org/3/py-modindex.html#cap-u

that there is no urllib2 in Python 3, just urllib.

I recommend you read this:

 https://docs.python.org/3/whatsnew/3.0.html

which is a very useful overview of the main changes which came with Python 3, 
and covers almost all the structural changes (such as module renames); the 
3.0 release was the Big Change.



But you can save yourself a lot of trouble by using the excelent 3rd
party package called requests:
http://docs.python-requests.org/en/latest/


I've seen nothing of this.


You have now. It is very popular and widely liked.

Cheers,
Cameron Simpson c...@zip.com.au

'Supposing a tree fell down, Pooh, when we were underneath it?'
'Supposing it didn't,' said Pooh after careful thought.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib confusion

2014-11-22 Thread Steven D'Aprano

On Fri, Nov 21, 2014 at 01:37:45PM -0800, Clayton Kirkwood wrote:

 Got a general problem with url work. I've struggled through a lot of code
 which uses urllib.[parse,request]* and urllib2. First q: I read someplace in
 urllib documentation which makes it sound like either urllib or urllib2
 modules are being deprecated in 3.5. Don't know if it's only part or whole.

Can you point us to this place? I would be shocked and rather dismayed 
to hear that urllib(2) was being deprecated, but it is possible that one 
small component is being renamed/moved/deprecated.

 I've read through a lot that says that urllib..urlopen needs urlencode,
 and/or encode('utf-8') for byte conversion, but I've seen plenty of examples
 where nothing is being encoded either way. I also have a sneeking suspicious
 that urllib2 code does all of the encoding. I've read that if things aren't
 encoded that I will get TypeError, yet I've seen plenty of examples where
 there is no error and no encoding.

It's hard to comment and things you've read when we don't know what they 
are or precisely what they say. I read that... is the equivalent of a 
man down the pub told me

If the examples are all ASCII, then no charset encoding is 
needed, although urlencode will still perform percent-encoding:

py from urllib.parse import urlencode
py urlencode({key: value})
'key=%3Cvalue%3E'

The characters '' and '' are not legal inside URLs, so they have to be 
encoded as '%3C' and '%3E'. Because all the characters are ASCII, the 
result remains untouched.

Non-ASCII characters, on the other hand, are encoded into UTF-8 by 
default, although you can pick another encoding and/or error handler:

py urlencode({key: © 2014})
'key=%C2%A9+2014'

The copyright symbol © encoded into UTF-8 is the two bytes 
\xC2\xA9 which are then percent encoded into %C2%A9.


 Why do so many examples seem to not encode? And not get TypeError? And yes,
 for those of you who are about to suggest it, I have tried a lot of things
 and read for many hours.

One actual example is worth about a thousand vague descriptions.

But in general, I would expect that the urllib functions default to 
using UTF-8 as the encoding, so you don't have to manually specify an 
encoding, it just works.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] urllib confusion

2014-11-21 Thread Clayton Kirkwood

Hi all.

 

Got a general problem with url work. I've struggled through a lot of code
which uses urllib.[parse,request]* and urllib2. First q: I read someplace in
urllib documentation which makes it sound like either urllib or urllib2
modules are being deprecated in 3.5. Don't know if it's only part or whole.

I've read through a lot that says that urllib..urlopen needs urlencode,
and/or encode('utf-8') for byte conversion, but I've seen plenty of examples
where nothing is being encoded either way. I also have a sneeking suspicious
that urllib2 code does all of the encoding. I've read that if things aren't
encoded that I will get TypeError, yet I've seen plenty of examples where
there is no error and no encoding.

 

Why do so many examples seem to not encode? And not get TypeError? And yes,
for those of you who are about to suggest it, I have tried a lot of things
and read for many hours.

 

Thanks,

 

Clayton

 

 

 

You can tell the caliber of a man by his gun--c. kirkwood

 

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib confusion

2014-11-21 Thread Joel Goldstick

On Fri, Nov 21, 2014 at 4:37 PM, Clayton Kirkwood c...@godblessthe.us wrote:
 Hi all.



 Got a general problem with url work. I’ve struggled through a lot of code
 which uses urllib.[parse,request]* and urllib2. First q: I read someplace in
 urllib documentation which makes it sound like either urllib or urllib2
 modules are being deprecated in 3.5. Don’t know if it’s only part or whole.

The names of the modules changed I believe in v3.x.

But you can save yourself a lot of trouble by using the excelent 3rd
party package called requests:
http://docs.python-requests.org/en/latest/

Also, please use plaintext for your questions.  That way everyone can
read them, and the indentation won't get mangled

 I’ve read through a lot that says that urllib..urlopen needs urlencode,
 and/or encode(‘utf-8’) for byte conversion, but I’ve seen plenty of examples
 where nothing is being encoded either way. I also have a sneeking suspicious
 that urllib2 code does all of the encoding. I’ve read that if things aren’t
 encoded that I will get TypeError, yet I’ve seen plenty of examples where
 there is no error and no encoding.



 Why do so many examples seem to not encode? And not get TypeError? And yes,
 for those of you who are about to suggest it, I have tried a lot of things
 and read for many hours.



 Thanks,



 Clayton







 You can tell the caliber of a man by his gun--c. kirkwood




 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 https://mail.python.org/mailman/listinfo/tutor




-- 
Joel Goldstick
http://joelgoldstick.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib confusion

2014-11-21 Thread Alan Gauld


On 21/11/14 21:37, Clayton Kirkwood wrote:


urllib or urllib2 modules are being deprecated in 3.5. Don’t know if
it’s only part or whole.


urlib2 doesn't exist in Python3 there is only the urllib package.

As to urllib being deprecated, thats the first I've heard of
it but it may be the case - I don;t follow the new releases closely 
since I'm usually at least 2 releases behind. I only upgraded to 3.4 
because I was writing the new book and needed it to be as current as 
possible.


But the What's New document for the 3.5 alpha says:

A new urllib.request.HTTPBasicPriorAuthHandler allows HTTP Basic 
Authentication credentials to be sent unconditionally with the first 
HTTP request, rather than waiting for a HTTP 401 Unauthorized response 
from the server. (Contributed by Matej Cepl in issue 19494.)


And the NEWS file adds:

urllib.request.urlopen will accept a context object
 (SSLContext) as an argument which will then used be
for HTTPS connection.  Patch by Alex Gaynor.

Which suggests urllib is alive and kicking...


I’ve read through a lot that says that urllib..urlopen needs urlencode,
and/or encode(‘utf-8’) for byte conversion, but I’ve seen plenty of
examples where nothing is being encoded either way.


Might those be v2 examples?
encoding got a whole lot more specific in Python v3.

But I'm not sure what you mean by the double dot.
urllib.urlopen is discontinued in Python3. You
should be using urllib.request.urlopen instead.
(But maybe thats what you meant by the ..?)


Why do so many examples seem to not encode? And not get TypeError?


Without specific examples it's hard to know.


--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my phopto-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib confusion

2014-11-21 Thread Clayton Kirkwood

-Original Message-
From: Joel Goldstick [mailto:joel.goldst...@gmail.com]
Sent: Friday, November 21, 2014 2:39 PM
To: Clayton Kirkwood
Cc: tutor@python.org
Subject: Re: [Tutor] urllib confusion

On Fri, Nov 21, 2014 at 4:37 PM, Clayton Kirkwood c...@godblessthe.us
wrote:
 Hi all.

 Got a general problem with url work. I’ve struggled through a lot of
 code which uses urllib.[parse,request]* and urllib2. First q: I read
 someplace in urllib documentation which makes it sound like either
 urllib or urllib2 modules are being deprecated in 3.5. Don’t know if
it’s only part or whole.

The names of the modules changed I believe in v3.x.

I don't think so because I've seen both lib and lib2 in both new and old code, 
and current 4.3 documentation talks only of urllib.

But you can save yourself a lot of trouble by using the excelent 3rd
party package called requests:
http://docs.python-requests.org/en/latest/

I've seen nothing of this.

Also, please use plaintext for your questions.  That way everyone can
read them, and the indentation won't get mangled

 I’ve read through a lot that says that urllib..urlopen needs
 urlencode, and/or encode(‘utf-8’) for byte conversion, but I’ve seen
 plenty of examples where nothing is being encoded either way. I also
 have a sneeking suspicious that urllib2 code does all of the encoding.
 I’ve read that if things aren’t encoded that I will get TypeError, yet
 I’ve seen plenty of examples where there is no error and no encoding.

 Why do so many examples seem to not encode? And not get TypeError? And
 yes, for those of you who are about to suggest it, I have tried a lot
 of things and read for many hours.

 Thanks,

 Clayton

 You can tell the caliber of a man by his gun--c. kirkwood

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 https://mail.python.org/mailman/listinfo/tutor

--
Joel Goldstick
http://joelgoldstick.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] Urllib Problem

2011-07-29 Thread George Anonymous

I am trying to make a simple programm with Python 3,that tries to open
differnet pages from a wordlist and prints which are alive.Here is the code:
from urllib import request
fob=open('c:/passwords/pass.txt','r')
x = fob.readlines()
for i in x:
urllib.request.openurl('www.google.gr/' + i)

But it doesent work.Whats the problem?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib Problem

2011-07-29 Thread Karim


On 07/29/2011 11:52 AM, George Anonymous wrote:
I am trying to make a simple programm with Python 3,that tries to open 
differnet pages from a wordlist and prints which are alive.Here is the 
code:

from urllib import request
fob=open('c:/passwords/pass.txt','r')
x = fob.readlines()
for i in x:
urllib.request.openurl('www.google.gr/ http://www.google.gr/' + i)

But it doesent work.Whats the problem?



Please give the exception error you get?!
And you should have in the html header
the html code error number which gives
you the fail answer from the server.

Cheers
Karim



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib Problem

2011-07-29 Thread Alexander

On Fri, Jul 29, 2011 at 5:58 AM, Karim karim.liat...@free.fr wrote:

 **
 On 07/29/2011 11:52 AM, George Anonymous wrote:

 I am trying to make a simple programm with Python 3,that tries to open
 differnet pages from a wordlist and prints which are alive.Here is the code:
 from urllib import request
 fob=open('c:/passwords/pass.txt','r')
 x = fob.readlines()
 for i in x:
 urllib.request.openurl('www.google.gr/' + i)

 But it doesent work.Whats the problem?


 Please give the exception error you get?!
 And you should have in the html header
 the html code error number which gives
 you the fail answer from the server.

 Cheers
 Karim

 As Karim noted you'll want to mention any exceptions you are getting. I'm
not sure what it is you are trying to do with your code. If you'd like to
try to open each line and try something if it works else an exception the
code may read something similar to:

fob = open('C:/passwords/pass.txt','r')
fob_rlines = fob.readlines()
for line in fob_rlines:
try:
#whatever it is you would like to do with each line
except Exception: #where code didn't work and an exception occured
#whatever you would like to do when a particular *Exception* occurs
Hope that helps,
Alexander


 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription 
 options:http://mail.python.org/mailman/listinfo/tutor



 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib Problem

2011-07-29 Thread Steven D'Aprano


George Anonymous wrote:

I am trying to make a simple programm with Python 3,that tries to open
differnet pages from a wordlist and prints which are alive.Here is the code:
from urllib import request
fob=open('c:/passwords/pass.txt','r')
x = fob.readlines()
for i in x:
urllib.request.openurl('www.google.gr/' + i)

But it doesent work.Whats the problem?



A guessing game! I LOVE guessing games!!! :)

Let's seen let me guess what you mean by doesn't work:

- the computer locks up and sits there until you hit the restart switch
- the computer gives a Blue Screen Of Death
- Python raises an exception
- Python downloads the Yahoo website instead of Google
- something else


My guess is... you're getting a NameError exception, like this one:


 from urllib import request
 x = urllib.request.openurl('www.google.com')
Traceback (most recent call last):
  File stdin, line 1, in module
NameError: name 'urllib' is not defined


Am I close?


You need to use request.urlopen, not urllib.request.openurl.

That's your *first* problem. There are more. Come back if you need help 
with the others, and next time, don't make us play guessing games. Show 
us the code you use -- copy and paste it, don't retype it from memory -- 
what you expect should happen, and what actually happens instead.





--
Steven

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

[Tutor] urllib problem

2010-10-12 Thread Roelof Wobben



Hoi, 
 
I have this programm :
 
import urllib
import re
f = 
urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=6;)
inhoud = f.read()
f.close()
nummer = re.search('[0-9]', inhoud)
volgende = int(nummer.group())
teller = 1 
while teller = 3 :
  url = http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; + 
str(volgende)
  f = urllib.urlopen(url)
  inhoud = f.read()
  f.close()
  nummer = re.search('[0-9]', inhoud)
  print nummer is, nummer.group()
  volgende = int(nummer.group())
  print volgende
  teller = teller + 1
 
but now the url changes but volgende not.
 
What do I have done wrong ?
 
Roelof 
  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Evert Rol

 I have this program :
 
 import urllib
 import re
 f = 
 urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=6;)
 inhoud = f.read()
 f.close()
 nummer = re.search('[0-9]', inhoud)
 volgende = int(nummer.group())
 teller = 1 
 while teller = 3 :
  url = http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; + 
 str(volgende)
  f = urllib.urlopen(url)
  inhoud = f.read()
  f.close()
  nummer = re.search('[0-9]', inhoud)
  print nummer is, nummer.group()
  volgende = int(nummer.group())
  print volgende
  teller = teller + 1
 
 but now the url changes but volgende not.

I think number will change; *unless* you happen to retrieve the same number 
every time, even when you access a different url.
What is the result when you run this program, ie, the output of your print 
statements (then, also, print url)?
And, how can url change, but volgende not? Since url depends on volgende.

Btw, it may be better to use parentheses in your regular expression to 
explicitly group whatever you want to match, though the above will work (since 
it groups the whole match). But Python has this Explicit is better than 
implicit thing.

Cheers,

  Evert

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Steven D'Aprano

On Tue, 12 Oct 2010 11:40:17 pm Roelof Wobben wrote:
 Hoi,

 I have this programm :

 import urllib
 import re
 f =
 urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.php?
nothing=6) inhoud = f.read()
 f.close()
 nummer = re.search('[0-9]', inhoud)
 volgende = int(nummer.group())
 teller = 1
 while teller = 3 :
   url =
 http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; +
 str(volgende) f = urllib.urlopen(url)
   inhoud = f.read()
   f.close()
   nummer = re.search('[0-9]', inhoud)
   print nummer is, nummer.group()
   volgende = int(nummer.group())
   print volgende
   teller = teller + 1

 but now the url changes but volgende not.
 What do I have done wrong ?

Each time through the loop, you set volgende to the same result:

nummer = re.search('[0-9]', inhoud)
volgende = int(nummer.group())

Since inhoud never changes, and the search never changes, the search 
result never changes, and volgende never changes.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Steven D'Aprano

On Tue, 12 Oct 2010 11:58:03 pm Steven D'Aprano wrote:
 On Tue, 12 Oct 2010 11:40:17 pm Roelof Wobben wrote:
  Hoi,
 
  I have this programm :
 
  import urllib
  import re
  f =
  urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.ph
 p? nothing=6) inhoud = f.read()
  f.close()
  nummer = re.search('[0-9]', inhoud)
  volgende = int(nummer.group())
  teller = 1
  while teller = 3 :
url =
  http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; +
  str(volgende) f = urllib.urlopen(url)
inhoud = f.read()
f.close()
nummer = re.search('[0-9]', inhoud)
print nummer is, nummer.group()
volgende = int(nummer.group())
print volgende
teller = teller + 1
 
  but now the url changes but volgende not.
  What do I have done wrong ?

 Each time through the loop, you set volgende to the same result:

 nummer = re.search('[0-9]', inhoud)
 volgende = int(nummer.group())

 Since inhoud never changes, and the search never changes, the search
 result never changes, and volgende never changes.

Wait, sorry, inhoud should change... I missed the line inhoud = f.read()

My mistake, sorry about that. However, I can now see what is going 
wrong. Your regular expression only looks for a single digit:

re.search('[0-9]', inhoud)

If you want any number of digits, you need '[0-9]+' instead.


Starting from the first URL:

 f = urllib.urlopen(
... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=6;)
 inhoud = f.read()
 f.close()
 print inhoud
and the next nothing is 87599


but:

 nummer = re.search('[0-9]', inhoud)
 nummer.group()
'8'

See, you only get the first digit. Then looking up the page with 
nothing=8 gives a first digit starting with 5, and then you get stuck 
on 5 forever:

 urllib.urlopen(
... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=8;).read() 
'and the next nothing is 59212'
 urllib.urlopen(
... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=5;).read() 
'and the next nothing is 51716'


You need to add a + to the regular expression, which means one or more 
digits instead of a single digit.



-- 
Steven D'Aprano
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Roelof Wobben





 From: st...@pearwood.info
 To: tutor@python.org
 Date: Tue, 12 Oct 2010 23:58:03 +1100
 Subject: Re: [Tutor] urllib problem

 On Tue, 12 Oct 2010 11:40:17 pm Roelof Wobben wrote:
 Hoi,

 I have this programm :

 import urllib
 import re
 f =
 urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.php?
nothing=6) inhoud = f.read()
 f.close()
 nummer = re.search('[0-9]', inhoud)
 volgende = int(nummer.group())
 teller = 1
 while teller = 3 :
 url =
 http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; +
 str(volgende) f = urllib.urlopen(url)
 inhoud = f.read()
 f.close()
 nummer = re.search('[0-9]', inhoud)
 print nummer is, nummer.group()
 volgende = int(nummer.group())
 print volgende
 teller = teller + 1

 but now the url changes but volgende not.
 What do I have done wrong ?

 Each time through the loop, you set volgende to the same result:

 nummer = re.search('[0-9]', inhoud)
 volgende = int(nummer.group())

 Since inhoud never changes, and the search never changes, the search
 result never changes, and volgende never changes.



 --
 Steven D'Aprano
 ___
 Tutor maillist - Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor
 
 
Hello, 
 
Here is the output when I print every step in the beginning :
 
inhoud : and the next nothing is 87599
nummer is 8
volgende is  8
 
and here is the output in the loop :
 
 
url is: http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=8
inhoud is and the next nothing is 59212
nummer is 5

 
2ste run:
url is: http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=5
inhoud is and the next nothing is 51716
nummer is 5

3ste run:
url is: http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=5
inhoud is and the next nothing is 51716
nummer is 5

4ste run:

I see the problem. It only takes the first number of the nothing.
So I have to look how to solve that.
 
Roelof

  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Roelof Wobben

 From: st...@pearwood.info
 To: tutor@python.org
 Date: Wed, 13 Oct 2010 01:51:16 +1100
 Subject: Re: [Tutor] urllib problem

 On Tue, 12 Oct 2010 11:58:03 pm Steven D'Aprano wrote:
  On Tue, 12 Oct 2010 11:40:17 pm Roelof Wobben wrote:
   Hoi,

   I have this programm :

   import urllib
   import re
   f =
   urllib.urlopen(http://www.pythonchallenge.com/pc/def/linkedlist.ph
  p? nothing=6) inhoud = f.read()
   f.close()
   nummer = re.search('[0-9]', inhoud)
   volgende = int(nummer.group())
   teller = 1
   while teller = 3 :
   url =
   http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=; +
   str(volgende) f = urllib.urlopen(url)
   inhoud = f.read()
   f.close()
   nummer = re.search('[0-9]', inhoud)
   print nummer is, nummer.group()
   volgende = int(nummer.group())
   print volgende
   teller = teller + 1

   but now the url changes but volgende not.
   What do I have done wrong ?

  Each time through the loop, you set volgende to the same result:

  nummer = re.search('[0-9]', inhoud)
  volgende = int(nummer.group())

  Since inhoud never changes, and the search never changes, the search
  result never changes, and volgende never changes.

 Wait, sorry, inhoud should change... I missed the line inhoud = f.read()

 My mistake, sorry about that. However, I can now see what is going
 wrong. Your regular expression only looks for a single digit:

 re.search('[0-9]', inhoud)

 If you want any number of digits, you need '[0-9]+' instead.

 Starting from the first URL:

  f = urllib.urlopen(
 ... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=6;)
  inhoud = f.read()
  f.close()
  print inhoud
 and the next nothing is 87599

 but:

  nummer = re.search('[0-9]', inhoud)
  nummer.group()
 '8'

 See, you only get the first digit. Then looking up the page with
 nothing=8 gives a first digit starting with 5, and then you get stuck
 on 5 forever:

  urllib.urlopen(
 ... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=8;).read()
 'and the next nothing is 59212'
  urllib.urlopen(
 ... http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=5;).read()
 'and the next nothing is 51716'

 You need to add a + to the regular expression, which means one or more
 digits instead of a single digit.

 --
 Steven D'Aprano
 ___
 Tutor maillist - Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor

Hoi Steven, 

Finally solved this puzzle.
Now the next one of the 33 puzzles.

Roelof

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib problem

2010-10-12 Thread Alan Gauld



Roelof Wobben rwob...@hotmail.com wrote


Finally solved this puzzle.
Now the next one of the 33 puzzles.


Don;t be surprised if you get stuck. Python Challenge is quite tricky
and is deliberately designed to make you explore parts of the
standard library you might not otherwise find. Expect to do a lot
of reading in the documebntation.

Its really targeted at intermediate rather than novice
programmers IMHO.

--
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib

2009-12-07 Thread Jojo Mwebaze

thanks, Senthil

On Mon, Dec 7, 2009 at 11:10 AM, Senthil Kumaran orsent...@gmail.comwrote:

 On Mon, Dec 07, 2009 at 08:38:24AM +0100, Jojo Mwebaze wrote:
  I need help on something very small...
 
  i am using urllib to write a query and what i want returned is
 'FHI=128%2C128
  FLO=1%2C1'
 

 The way to use urllib.encode is like this:

  urllib.urlencode({key:value})
 'key=value'
  urllib.urlencode({key:value,key2:value2})
 'key2=value2key=value'

 For your purpses, you need to construct the dict this way:

  urllib.urlencode({FHI:'128,128',FHO:'1,1'})
 'FHO=1%2C1FHI=128%2C128'
 


 And if you are to use variables, one way to do it would be:

  x1,y1,x2,y2 = 1,1,128,128
  fhi = str(x2) + ',' + str(y2)
  fho = str(x1) + ',' + str(y1)
  urllib.urlencode({FHI:fhi,FHO:fho})
 'FHO=1%2C1FHI=128%2C128'

 --
 Senthil

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Kent Johnson

On Mon, Jul 6, 2009 at 5:54 PM, David Kimdavidki...@gmail.com wrote:
 Hello all,

 I have two questions I'm hoping someone will have the patience to
 answer as an act of mercy.

 I. How to get past a Terms of Service page?

 I've just started learning python (have never done any programming
 prior) and am trying to figure out how to open or download a website
 to scrape data. The only problem is, whenever I try to open the link
 (via urllib2, for example) I'm after, I end up getting the HTML to a
 Terms of Service Page (where one has to click an I Agree button)
 rather than the actual target page.

 I've seen examples on the web on providing data for forms (typically
 by finding the name of the form and providing some sort of dictionary
 to fill in the form fields), but this simple act of getting past I
 Agree is stumping me. Can anyone save my sanity? As a workaround,
 I've been using os.popen('curl ' + url ' ' filename) to save the html
 in a txt file for later processing. I have no idea why curl works and
 urllib2, for example, doesn't (I use OS X).

curl works because it ignores the redirect to the ToS page, and the
site is (astoundingly) dumb enough to serve the content with the
redirect. You could make urllib2 behave the same way by defining a 302
handler that does nothing.

 I even tried to use Yahoo
 Pipes to try and sidestep coding anything altogether, but ended up
 looking at the same Terms of Service page anyway.

 Here's the code (tho it's probably not that illuminating since it's
 basically just opening a url):

 import urllib2
 url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
 #the first of 23 tables
 html = urllib2.urlopen(url).read()

Generally you have to post to the same url as the form, giving the
same data the form does. You can inspect the source of the form to
figure this out. In this case the form is
form method=post action=/products/consent.php
input type=hidden value=tiwd/products/derivserv/data_table_i.php
name=urltarget/
input type=hidden value=1 name=check_one/
input type=hidden value=tiwdata name=tag/
input type=submit value=I Agree name=acknowledgement/
input type=submit value=Decline name=acknowledgement/
/form

You generally need to enable cookie support in urllib2 as well,
because the site will use a cookie to flag that you saw the consent
form. This tutorial shows how to enable cookies and submit form data:
http://personalpages.tds.net/~kent37/kk/00010.html

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Sander Sweers

2009/7/7 David Kim davidki...@gmail.com:
 opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
 urllib2.install_opener(opener)

 response = 
 urllib2.urlopen(http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1;)
 print response.read()
 

 I suspect I am not understanding something basic about how urllib2
 deals with this redirect issue since it seems everything I try gives
 me the same ToS page.

Indeed, you create the opener but then you do not use it. Try the
below and it should work.
  response = 
opener.open(http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1;)
  data = response.read()

Greets
Sander
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Kent Johnson

On Tue, Jul 7, 2009 at 1:20 PM, David Kimdavidki...@gmail.com wrote:
 On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnsonken...@tds.net wrote:

 curl works because it ignores the redirect to the ToS page, and the
 site is (astoundingly) dumb enough to serve the content with the
 redirect. You could make urllib2 behave the same way by defining a 302
 handler that does nothing.

 Many thanks for the redirect pointer! I also found
 http://diveintopython.org/http_web_services/redirects.html. Is the
 handler class on this page what you mean by a handler that does
 nothing? (It looks like it exposes the error code but still follows
 the redirect).

No, all of those examples are handling the redirect. The
SmartRedirectHandler just captures additional status. I think you need
something like this:
class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
return None

def http_error_302(self, req, fp, code, msg, headers):
return None

 I guess i'm still a little confused since, if the
 handler does nothing, won't I still go to the ToS page?

No, it is the action of the handler, responding to the redirect
request, that causes the ToS page to be fetched.

 For example, I ran the following code (found at
 http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)

That is pretty similar to the DiP code...

 I suspect I am not understanding something basic about how urllib2
 deals with this redirect issue since it seems everything I try gives
 me the same ToS page.

Maybe you don't understand how redirect works in general...

 Generally you have to post to the same url as the form, giving the
 same data the form does. You can inspect the source of the form to
 figure this out. In this case the form is

 form method=post action=/products/consent.php
 input type=hidden value=tiwd/products/derivserv/data_table_i.php
 name=urltarget/
 input type=hidden value=1 name=check_one/
 input type=hidden value=tiwdata name=tag/
 input type=submit value=I Agree name=acknowledgement/
 input type=submit value=Decline name=acknowledgement/
 /form

 You generally need to enable cookie support in urllib2 as well,
 because the site will use a cookie to flag that you saw the consent
 form. This tutorial shows how to enable cookies and submit form data:
 http://personalpages.tds.net/~kent37/kk/00010.html

 I have seen the login examples where one provides values for the
 fields username and password (thanks Kent). Given the form above,
 however, it's unclear to me how one POSTs the form data when you
 aren't actually passing any parameters. Perhaps this is less of a
 Python question and more an http question (which unfortunately I know
 nothing about either).

Yes, the parameters are listed in the form.

If you don't have at least a basic understanding of HTTP and HTML you
are going to have trouble with this project...

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread David Kim

Thanks Kent, perhaps I'll cool the Python jets and move on to HTTP and
HTML. I was hoping it would be something I could just pick up along
the way, looks like I was wrong.

dk

On Tue, Jul 7, 2009 at 1:56 PM, Kent Johnsonken...@tds.net wrote:
 On Tue, Jul 7, 2009 at 1:20 PM, David Kimdavidki...@gmail.com wrote:
 On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnsonken...@tds.net wrote:

 curl works because it ignores the redirect to the ToS page, and the
 site is (astoundingly) dumb enough to serve the content with the
 redirect. You could make urllib2 behave the same way by defining a 302
 handler that does nothing.

 Many thanks for the redirect pointer! I also found
 http://diveintopython.org/http_web_services/redirects.html. Is the
 handler class on this page what you mean by a handler that does
 nothing? (It looks like it exposes the error code but still follows
 the redirect).

 No, all of those examples are handling the redirect. The
 SmartRedirectHandler just captures additional status. I think you need
 something like this:
 class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        return None

    def http_error_302(self, req, fp, code, msg, headers):
        return None

 I guess i'm still a little confused since, if the
 handler does nothing, won't I still go to the ToS page?

 No, it is the action of the handler, responding to the redirect
 request, that causes the ToS page to be fetched.

 For example, I ran the following code (found at
 http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)

 That is pretty similar to the DiP code...

 I suspect I am not understanding something basic about how urllib2
 deals with this redirect issue since it seems everything I try gives
 me the same ToS page.

 Maybe you don't understand how redirect works in general...

 Generally you have to post to the same url as the form, giving the
 same data the form does. You can inspect the source of the form to
 figure this out. In this case the form is

 form method=post action=/products/consent.php
 input type=hidden value=tiwd/products/derivserv/data_table_i.php
 name=urltarget/
 input type=hidden value=1 name=check_one/
 input type=hidden value=tiwdata name=tag/
 input type=submit value=I Agree name=acknowledgement/
 input type=submit value=Decline name=acknowledgement/
 /form

 You generally need to enable cookie support in urllib2 as well,
 because the site will use a cookie to flag that you saw the consent
 form. This tutorial shows how to enable cookies and submit form data:
 http://personalpages.tds.net/~kent37/kk/00010.html

 I have seen the login examples where one provides values for the
 fields username and password (thanks Kent). Given the form above,
 however, it's unclear to me how one POSTs the form data when you
 aren't actually passing any parameters. Perhaps this is less of a
 Python question and more an http question (which unfortunately I know
 nothing about either).

 Yes, the parameters are listed in the form.

 If you don't have at least a basic understanding of HTTP and HTML you
 are going to have trouble with this project...

 Kent




-- 
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-06 Thread David Kim

Hello all,

I have two questions I'm hoping someone will have the patience to
answer as an act of mercy.

I. How to get past a Terms of Service page?

I've just started learning python (have never done any programming
prior) and am trying to figure out how to open or download a website
to scrape data. The only problem is, whenever I try to open the link
(via urllib2, for example) I'm after, I end up getting the HTML to a
Terms of Service Page (where one has to click an I Agree button)
rather than the actual target page.

I've seen examples on the web on providing data for forms (typically
by finding the name of the form and providing some sort of dictionary
to fill in the form fields), but this simple act of getting past I
Agree is stumping me. Can anyone save my sanity? As a workaround,
I've been using os.popen('curl ' + url ' ' filename) to save the html
in a txt file for later processing. I have no idea why curl works and
urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo
Pipes to try and sidestep coding anything altogether, but ended up
looking at the same Terms of Service page anyway.

Here's the code (tho it's probably not that illuminating since it's
basically just opening a url):

import urllib2
url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
#the first of 23 tables
html = urllib2.urlopen(url).read()

II. How to parse html tables with lxml, beautifulsoup? (for dummies)

Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.
I've tried looking at the documentation included with different python
libraries, but just got more confused.

The basic tutorials show something like the following:

from lxml import html
doc = html.parse(/path/to/test.txt) #the file i downloaded via curl
root = doc.getroot() #what is this root business?
tables = root.cssselect('table')

I understand that selecting all the table tags will somehow target
however many tables on the page. The problem is the table has multiple
headers, empty cells, etc. Most of the examples on the web have to do
with scraping the web for search results or something that don't
really depend on the table format for anything other than layout. Are
there any resources out there that are appropriate for web/python
illiterati like myself that deal with structured data as in the url
above?

FYI, the data in the url above goes up in smoke every week, so I'm
trying to capture it automatically on a weekly basis. Getting all of
it into a CSV or database would be a personal cause for celebration as
it would be the first really useful thing I've done with python since
starting to learn it a few months ago.

For anyone who is interested, here is the code that uses curl to
pull the webpages. It basically just builds the url string for the
different table-pages and saves down the file with a timestamped
filename:

import os
from time import strftime

BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_'
SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)},
'section2':{'select':'ii.php?id=table', 'id':range(9,17)},
'section3':{'select':'iii.php?id=table', 'id':range(17,24)}
}

def get_pages():

filenames = []
path = '~/Dev/Data/DTCC_DerivServ/'
#os.popen('cd ' + path)

for section in SECTIONS:
for id in SECTIONS[section]['id']:
#urlList.append(BASE_URL + SECTIONS[section]['select']+str(id))
url = BASE_URL + SECTIONS[section]['select'] + str(id)
timestamp = strftime('%Y%m%d_')
#sectionName = BASE_URL.split('/')[-1]
sectionNumber = SECTIONS[section]['select'].split('.')[0]
tableNumber = str(id) + '_'
filename = timestamp + tableNumber + sectionNumber + '.txt'
os.popen('curl ' + url + ' ' + path + filename)
filenames.append(filename)

return filenames

if (__name__ == '__main__'):
get_pages()


--
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-06 Thread Stefan Behnel

Hi,

David Kim wrote:
 I have two questions I'm hoping someone will have the patience to
 answer as an act of mercy.
 
 I. How to get past a Terms of Service page?
 
 I've just started learning python (have never done any programming
 prior) and am trying to figure out how to open or download a website
 to scrape data. The only problem is, whenever I try to open the link
 (via urllib2, for example) I'm after, I end up getting the HTML to a
 Terms of Service Page (where one has to click an I Agree button)
 rather than the actual target page.

One comment to make here is that you should first read that page and check
if the provider of the service actually allows you to automatically
download content, or to use the service in the way you want. This is
totally up to them, and if their terms of service state that you must not
do that, well, then you must not do that.

Once you know that it's permitted, you can read the ToS page and search for
the form that the Agree button triggers. The URL given there is the one
you have to read next, but augmented with the parameter (?xyz=...) that
the button sends.


 I've seen examples on the web on providing data for forms (typically
 by finding the name of the form and providing some sort of dictionary
 to fill in the form fields), but this simple act of getting past I
 Agree is stumping me. Can anyone save my sanity? As a workaround,
 I've been using os.popen('curl ' + url ' ' filename) to save the html
 in a txt file for later processing. I have no idea why curl works and
 urllib2, for example, doesn't (I use OS X).

There may be different reasons for that. One is that web servers often
present different content based on the client identifier. So if you see one
page with one client, and another page with a different client, that may be
the reason.


 Here's the code (tho it's probably not that illuminating since it's
 basically just opening a url):
 
 import urllib2
 url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
 #the first of 23 tables
 html = urllib2.urlopen(url).read()

Hmmm, if what you want is to read a stock ticker or something like that,
you should *really* read their ToS first and make sure they do not disallow
automated access. Because it's actually quite likely that they do.


 II. How to parse html tables with lxml, beautifulsoup? (for dummies)
 
 Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
 need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.

Using CSS selectors (lxml.cssselect) is not at all hard. You basically
express the page structure in a *very* short and straight forward way.

Searching the web for a CSS selectors tutorial should give you a few hits.


 The basic tutorials show something like the following:
 
 from lxml import html
 doc = html.parse(/path/to/test.txt) #the file i downloaded via curl

... or read from the standard output pipe of curl. Note that there is a
stdlib module called subprocess, which may make running curl easier.

Once you've determined the final URL to parse, you can also push it right
into lxml's parse() function, instead of going through urllib2 or an
external tool. Example:

url = http://pypi.python.org/pypi?%3Aaction=searchterm=lxml;
doc = html.parse(url)


 root = doc.getroot() #what is this root business?

The root (or top-most) node of the document you just parsed. Usually an
html tag in HTML pages.


 tables = root.cssselect('table')

Simple, isn't it? :)

BTW, did you look at this?

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/


 I understand that selecting all the table tags will somehow target
 however many tables on the page. The problem is the table has multiple
 headers, empty cells, etc. Most of the examples on the web have to do
 with scraping the web for search results or something that don't
 really depend on the table format for anything other than layout.

That's because in cases like yours, you have to do most of the work
yourself anyway. No page is like the other, so you have to find your way
through the structure and figure out fixed points that allow you to get to
the data.

Stefan

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-17 Thread Sander Sweers

On Tue, Feb 17, 2009 at 08:54, Norman Khine nor...@khine.net wrote:
 Thank you, but is it possible to get the original string from this?

You mean something like this?

 urllib.quote('hL/FGNS40fjoTnp2zIqq73reK60=\n')
'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'

Greets
Sander
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-17 Thread Senthil Kumaran

On Tue, Feb 17, 2009 at 1:24 PM, Norman Khine nor...@khine.net wrote:
 Thank you, but is it possible to get the original string from this?

What do you mean by the original string Norman?
Look at these definitions:

Quoted String:

In the different parts of the URL, there are set of characters, for
e.g. space character in path, that must be quoted, which means
converted to a different form so that url is understood by the
program.
So ' ' is quoted to %20.

Unquoted String:

When %20 comes in the URL, humans need it unquoted so that we can understand it.


What do you mean by original string?
Why are you doing base64 encoding?
And what are you trying to achieve?

Perhaps these can help us to help you better?



-- 
-- 
Senthil
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-17 Thread Kent Johnson

On Mon, Feb 16, 2009 at 8:12 AM, Norman Khine nor...@khine.net wrote:
 Hello,
 Can someone point me in the right direction. I would like to return the
 string for the following:

 Type help, copyright, credits or license for more information.
 import base64, urllib
 data = 'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'
 data = urllib.unquote(data)
 print base64.decodestring(data)
 ???Ը???Nzv̊??z?+?


 What am I missing?

How is data created? Since it doesn't decode as you expect, either it
isn't base64 or there is some other processing needed. Do you have an
example of a data string where you know the desired decoded value?

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-17 Thread Norman Khine

it is my error, the data is a sha string and it is not possible to get 
the string back, unless you use rainbowtables or something of the sort.


Kent Johnson wrote:

On Mon, Feb 16, 2009 at 8:12 AM, Norman Khine nor...@khine.net wrote:

Hello,
Can someone point me in the right direction. I would like to return the
string for the following:

Type help, copyright, credits or license for more information.

import base64, urllib
data = 'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'
data = urllib.unquote(data)
print base64.decodestring(data)

???Ը???Nzv̊??z?+?
What am I missing?


How is data created? Since it doesn't decode as you expect, either it
isn't base64 or there is some other processing needed. Do you have an
example of a data string where you know the desired decoded value?

Kent


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] urllib unquote

2009-02-16 Thread Norman Khine


Hello,
Can someone point me in the right direction. I would like to return the 
string for the following:


Type help, copyright, credits or license for more information.
 import base64, urllib
 data = 'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'
 data = urllib.unquote(data)
 print base64.decodestring(data)
???Ը???Nzv̊??z?+?


What am I missing?

Cheers
Norman


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-16 Thread Sander Sweers

On Mon, Feb 16, 2009 at 14:12, Norman Khine nor...@khine.net wrote:
 Type help, copyright, credits or license for more information.
 import base64, urllib
 data = 'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'
 data = urllib.unquote(data)
 print base64.decodestring(data)
 ???Ը???Nzv̊??z?+?


 What am I missing?

Not an expert here but I think you can skip the last step...

 urllib.unquote('hL/FGNS40fjoTnp2zIqq73reK60%3D%0A')
'hL/FGNS40fjoTnp2zIqq73reK60=\n'


Greets
Sander
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib unquote

2009-02-16 Thread Norman Khine


Thank you, but is it possible to get the original string from this?

Sander Sweers wrote:

On Mon, Feb 16, 2009 at 14:12, Norman Khine nor...@khine.net wrote:

Type help, copyright, credits or license for more information.

import base64, urllib
data = 'hL/FGNS40fjoTnp2zIqq73reK60%3D%0A'
data = urllib.unquote(data)
print base64.decodestring(data)

???Ը???Nzv̊??z?+?
What am I missing?


Not an expert here but I think you can skip the last step...


urllib.unquote('hL/FGNS40fjoTnp2zIqq73reK60%3D%0A')

'hL/FGNS40fjoTnp2zIqq73reK60=\n'


Greets
Sander


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] URLLIB / GLOB

2007-10-22 Thread John

Hello,

I would like to write a program which looks in a web directory for, say
*.gif files. Then processes those files in some manner. What I need is
something like glob which will return a directory listing of all the files
matching the search pattern (or just a simply a certain extension).

Is there a way to do this with urllib? Any other suggestions?

Thanks!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] URLLIB / GLOB

2007-10-22 Thread Kent Johnson

John wrote:
 Hello,
  
 I would like to write a program which looks in a web directory for, say 
 *.gif files. Then processes those files in some manner. What I need is 
 something like glob which will return a directory listing of all the 
 files matching the search pattern (or just a simply a certain extension).
  
 Is there a way to do this with urllib? Any other suggestions?

If the directory is only available as a web page you will have to fetch 
the web directory listing itself with urllib or urllib2 and parse the 
HTML returned to get the list of files. You might want to use 
BeautifulSoup to parse the HTML.
http://www.crummy.com/software/BeautifulSoup/

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib

2006-09-17 Thread Patricia

Hi again,

I was able to use urllib2_file, which is a wrapper to urllib2.urlopen(). It
seems to work fine, and I'm able to retrieve the contents of the file using:
 
afile = req.form.list[1].file.read()

Now I have to store this text file (which is about 500k) and an id number into a
mysql database in a web server. I have a table that has two columns user id
(int) and mediumblob. The problem I have now is I don't know how to store them
into the database. I've been looking for examples without any luck. I tried
using load data infile, but it seems that I would need to have this client_side
file stored in the server. I  used load data local infile, and got some errors.
I also thought about storing them like this:

afile = req.form.list[1].file.read()
cursor.execute(insert into p_report (sales_order, file_cont )
values (%s, %s), (1, afile))

I really don't know which is the best way to do it. Which is the right approach?
I'm really hoping someone can give me an idea how to do it because I'm finding
this a frustrating.

Thanks,
Patricia




___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] urllib

2006-09-12 Thread Kent Johnson

Patricia wrote:
 Hi,
 
 I have used urllib and urllib2 to post data like the following:
 
 dict = {}
 dict['data'] = info
 dict['system'] = aname
 
 data = urllib.urlencode(dict)
 req = urllib2.Request(url)
 
 And to get the data, I emulated a web page with a submit button:   
 s = htmlbody
 s += form  action='a_method' method='POST'
 s += textarea cols='80' rows='200' name='data'/textarea
 s += input type='text' name='system'
 s += input type='submit' value='Submit'
 s += /form/body/html
 
 
 I would like to know how to send a file. It's a text file that will be 
 gzipped before being posted. I'm using python version 2.2.3.

There are some old examples hereA
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/146306

I think the modern way uses email.MIMEMultipart but I don't have an 
example handy.

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] urllib

2006-09-11 Thread Patricia

Hi,

I have used urllib and urllib2 to post data like the following:

dict = {}
dict['data'] = info
dict['system'] = aname

data = urllib.urlencode(dict)
req = urllib2.Request(url)

And to get the data, I emulated a web page with a submit button:   
s = htmlbody
s += form  action='a_method' method='POST'
s += textarea cols='80' rows='200' name='data'/textarea
s += input type='text' name='system'
s += input type='submit' value='Submit'
s += /form/body/html


I would like to know how to send a file. It's a text file that will be 
gzipped before being posted. I'm using python version 2.2.3.


Thanks,
Patricia


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] URLLIB

2005-05-13 Thread Servando Garcia

Hello list
I am on challenge 5. I think I need to some how download  a file.  I have been trying like so

X=urllib.URLopener(name,proxies={'http':'URL').distutils.copy_file('SomeFileName')

but with no luck.
Servando Garcia
John 3:16
For GOD so loved the world..___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] URLLIB

2005-05-13 Thread Kent Johnson

Servando Garcia wrote:
 Hello list
 I am on challenge 5. I think I need to some how download a file. I have 
 been trying like so
 
 X=urllib.URLopener(name,proxies={'http':'URL').distutils.copy_file('SomeFileName')
  

urlopener() returns a file-like object - something that behaves like an open 
file. Try
x = urllib.urlopener(name)
data = x.read()

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

44 matches

Mail list logo