Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread David Kim
Thanks Kent, perhaps I'll cool the Python jets and move on to HTTP and
HTML. I was hoping it would be something I could just pick up along
the way, looks like I was wrong.

dk

On Tue, Jul 7, 2009 at 1:56 PM, Kent Johnson wrote:
> On Tue, Jul 7, 2009 at 1:20 PM, David Kim wrote:
>> On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson wrote:
>>>
>>> curl works because it ignores the redirect to the ToS page, and the
>>> site is (astoundingly) dumb enough to serve the content with the
>>> redirect. You could make urllib2 behave the same way by defining a 302
>>> handler that does nothing.
>>
>> Many thanks for the redirect pointer! I also found
>> http://diveintopython.org/http_web_services/redirects.html. Is the
>> handler class on this page what you mean by a handler that does
>> nothing? (It looks like it exposes the error code but still follows
>> the redirect).
>
> No, all of those examples are handling the redirect. The
> SmartRedirectHandler just captures additional status. I think you need
> something like this:
> class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
>    def http_error_301(self, req, fp, code, msg, headers):
>        return None
>
>    def http_error_302(self, req, fp, code, msg, headers):
>        return None
>
>> I guess i'm still a little confused since, if the
>> handler does nothing, won't I still go to the ToS page?
>
> No, it is the action of the handler, responding to the redirect
> request, that causes the ToS page to be fetched.
>
>> For example, I ran the following code (found at
>> http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)
>
> That is pretty similar to the DiP code...
>
>> I suspect I am not understanding something basic about how urllib2
>> deals with this redirect issue since it seems everything I try gives
>> me the same ToS page.
>
> Maybe you don't understand how redirect works in general...
>
>>> Generally you have to post to the same url as the form, giving the
>>> same data the form does. You can inspect the source of the form to
>>> figure this out. In this case the form is
>>>
>>> 
>>> >> name="urltarget"/>
>>> 
>>> 
>>> 
>>> 
>>> 
>>>
>>> You generally need to enable cookie support in urllib2 as well,
>>> because the site will use a cookie to flag that you saw the consent
>>> form. This tutorial shows how to enable cookies and submit form data:
>>> http://personalpages.tds.net/~kent37/kk/00010.html
>>
>> I have seen the login examples where one provides values for the
>> fields username and password (thanks Kent). Given the form above,
>> however, it's unclear to me how one POSTs the form data when you
>> aren't actually passing any parameters. Perhaps this is less of a
>> Python question and more an http question (which unfortunately I know
>> nothing about either).
>
> Yes, the parameters are listed in the form.
>
> If you don't have at least a basic understanding of HTTP and HTML you
> are going to have trouble with this project...
>
> Kent
>



-- 
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Kent Johnson
On Tue, Jul 7, 2009 at 1:20 PM, David Kim wrote:
> On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson wrote:
>>
>> curl works because it ignores the redirect to the ToS page, and the
>> site is (astoundingly) dumb enough to serve the content with the
>> redirect. You could make urllib2 behave the same way by defining a 302
>> handler that does nothing.
>
> Many thanks for the redirect pointer! I also found
> http://diveintopython.org/http_web_services/redirects.html. Is the
> handler class on this page what you mean by a handler that does
> nothing? (It looks like it exposes the error code but still follows
> the redirect).

No, all of those examples are handling the redirect. The
SmartRedirectHandler just captures additional status. I think you need
something like this:
class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
return None

def http_error_302(self, req, fp, code, msg, headers):
return None

> I guess i'm still a little confused since, if the
> handler does nothing, won't I still go to the ToS page?

No, it is the action of the handler, responding to the redirect
request, that causes the ToS page to be fetched.

> For example, I ran the following code (found at
> http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)

That is pretty similar to the DiP code...

> I suspect I am not understanding something basic about how urllib2
> deals with this redirect issue since it seems everything I try gives
> me the same ToS page.

Maybe you don't understand how redirect works in general...

>> Generally you have to post to the same url as the form, giving the
>> same data the form does. You can inspect the source of the form to
>> figure this out. In this case the form is
>>
>> 
>> > name="urltarget"/>
>> 
>> 
>> 
>> 
>> 
>>
>> You generally need to enable cookie support in urllib2 as well,
>> because the site will use a cookie to flag that you saw the consent
>> form. This tutorial shows how to enable cookies and submit form data:
>> http://personalpages.tds.net/~kent37/kk/00010.html
>
> I have seen the login examples where one provides values for the
> fields username and password (thanks Kent). Given the form above,
> however, it's unclear to me how one POSTs the form data when you
> aren't actually passing any parameters. Perhaps this is less of a
> Python question and more an http question (which unfortunately I know
> nothing about either).

Yes, the parameters are listed in the form.

If you don't have at least a basic understanding of HTTP and HTML you
are going to have trouble with this project...

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Sander Sweers
2009/7/7 David Kim :
> opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
> urllib2.install_opener(opener)
>
> response = 
> urllib2.urlopen("http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1";)
> print response.read()
> 
>
> I suspect I am not understanding something basic about how urllib2
> deals with this redirect issue since it seems everything I try gives
> me the same ToS page.

Indeed, you create the opener but then you do not use it. Try the
below and it should work.
  response = 
opener.open("http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1";)
  data = response.read()

Greets
Sander
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread David Kim
On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnson wrote:
>
> curl works because it ignores the redirect to the ToS page, and the
> site is (astoundingly) dumb enough to serve the content with the
> redirect. You could make urllib2 behave the same way by defining a 302
> handler that does nothing.

Many thanks for the redirect pointer! I also found
http://diveintopython.org/http_web_services/redirects.html. Is the
handler class on this page what you mean by a handler that does
nothing? (It looks like it exposes the error code but still follows
the redirect). I guess i'm still a little confused since, if the
handler does nothing, won't I still go to the ToS page?

For example, I ran the following code (found at
http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)
and ended-up pulling the same ToS page anyway.


import urllib2

redirect_handler = urllib2.HTTPRedirectHandler()

class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
return urllib2.HTTPRedirectHandler.http_error_302(self, req,
fp, code, msg, headers)

http_error_301 = http_error_303 = http_error_307 = http_error_302

cookieprocessor = urllib2.HTTPCookieProcessor()

opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor)
urllib2.install_opener(opener)

response = 
urllib2.urlopen("http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1";)
print response.read()


I suspect I am not understanding something basic about how urllib2
deals with this redirect issue since it seems everything I try gives
me the same ToS page.

> Generally you have to post to the same url as the form, giving the
> same data the form does. You can inspect the source of the form to
> figure this out. In this case the form is
>
> 
>  name="urltarget"/>
> 
> 
> 
> 
> 
>
> You generally need to enable cookie support in urllib2 as well,
> because the site will use a cookie to flag that you saw the consent
> form. This tutorial shows how to enable cookies and submit form data:
> http://personalpages.tds.net/~kent37/kk/00010.html

I have seen the login examples where one provides values for the
fields username and password (thanks Kent). Given the form above,
however, it's unclear to me how one POSTs the form data when you
aren't actually passing any parameters. Perhaps this is less of a
Python question and more an http question (which unfortunately I know
nothing about either).

Thanks so much again for the help!

DK



--
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread Kent Johnson
On Mon, Jul 6, 2009 at 5:54 PM, David Kim wrote:
> Hello all,
>
> I have two questions I'm hoping someone will have the patience to
> answer as an act of mercy.
>
> I. How to get past a Terms of Service page?
>
> I've just started learning python (have never done any programming
> prior) and am trying to figure out how to open or download a website
> to scrape data. The only problem is, whenever I try to open the link
> (via urllib2, for example) I'm after, I end up getting the HTML to a
> Terms of Service Page (where one has to click an "I Agree" button)
> rather than the actual target page.
>
> I've seen examples on the web on providing data for forms (typically
> by finding the name of the form and providing some sort of dictionary
> to fill in the form fields), but this simple act of getting past "I
> Agree" is stumping me. Can anyone save my sanity? As a workaround,
> I've been using os.popen('curl ' + url ' >' filename) to save the html
> in a txt file for later processing. I have no idea why curl works and
> urllib2, for example, doesn't (I use OS X).

curl works because it ignores the redirect to the ToS page, and the
site is (astoundingly) dumb enough to serve the content with the
redirect. You could make urllib2 behave the same way by defining a 302
handler that does nothing.

> I even tried to use Yahoo
> Pipes to try and sidestep coding anything altogether, but ended up
> looking at the same Terms of Service page anyway.
>
> Here's the code (tho it's probably not that illuminating since it's
> basically just opening a url):
>
> import urllib2
> url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
> #the first of 23 tables
> html = urllib2.urlopen(url).read()

Generally you have to post to the same url as the form, giving the
same data the form does. You can inspect the source of the form to
figure this out. In this case the form is








You generally need to enable cookie support in urllib2 as well,
because the site will use a cookie to flag that you saw the consent
form. This tutorial shows how to enable cookies and submit form data:
http://personalpages.tds.net/~kent37/kk/00010.html

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-06 Thread Stefan Behnel
Hi,

David Kim wrote:
> I have two questions I'm hoping someone will have the patience to
> answer as an act of mercy.
> 
> I. How to get past a Terms of Service page?
> 
> I've just started learning python (have never done any programming
> prior) and am trying to figure out how to open or download a website
> to scrape data. The only problem is, whenever I try to open the link
> (via urllib2, for example) I'm after, I end up getting the HTML to a
> Terms of Service Page (where one has to click an "I Agree" button)
> rather than the actual target page.

One comment to make here is that you should first read that page and check
if the provider of the service actually allows you to automatically
download content, or to use the service in the way you want. This is
totally up to them, and if their terms of service state that you must not
do that, well, then you must not do that.

Once you know that it's permitted, you can read the ToS page and search for
the form that the "Agree" button triggers. The URL given there is the one
you have to read next, but augmented with the parameter ("?xyz=...") that
the button sends.


> I've seen examples on the web on providing data for forms (typically
> by finding the name of the form and providing some sort of dictionary
> to fill in the form fields), but this simple act of getting past "I
> Agree" is stumping me. Can anyone save my sanity? As a workaround,
> I've been using os.popen('curl ' + url ' >' filename) to save the html
> in a txt file for later processing. I have no idea why curl works and
> urllib2, for example, doesn't (I use OS X).

There may be different reasons for that. One is that web servers often
present different content based on the client identifier. So if you see one
page with one client, and another page with a different client, that may be
the reason.


> Here's the code (tho it's probably not that illuminating since it's
> basically just opening a url):
> 
> import urllib2
> url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
> #the first of 23 tables
> html = urllib2.urlopen(url).read()

Hmmm, if what you want is to read a stock ticker or something like that,
you should *really* read their ToS first and make sure they do not disallow
automated access. Because it's actually quite likely that they do.


> II. How to parse html tables with lxml, beautifulsoup? (for dummies)
> 
> Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
> need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.

Using CSS selectors (lxml.cssselect) is not at all hard. You basically
express the page structure in a *very* short and straight forward way.

Searching the web for a CSS selectors tutorial should give you a few hits.


> The basic tutorials show something like the following:
> 
> from lxml import html
> doc = html.parse("/path/to/test.txt") #the file i downloaded via curl

... or read from the standard output pipe of curl. Note that there is a
stdlib module called "subprocess", which may make running curl easier.

Once you've determined the final URL to parse, you can also push it right
into lxml's parse() function, instead of going through urllib2 or an
external tool. Example:

url = "http://pypi.python.org/pypi?%3Aaction=search&term=lxml";
doc = html.parse(url)


> root = doc.getroot() #what is this root business?

The root (or top-most) node of the document you just parsed. Usually an
"html" tag in HTML pages.


> tables = root.cssselect('table')

Simple, isn't it? :)

BTW, did you look at this?

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/


> I understand that selecting all the table tags will somehow target
> however many tables on the page. The problem is the table has multiple
> headers, empty cells, etc. Most of the examples on the web have to do
> with scraping the web for search results or something that don't
> really depend on the table format for anything other than layout.

That's because in cases like yours, you have to do most of the work
yourself anyway. No page is like the other, so you have to find your way
through the structure and figure out fixed points that allow you to get to
the data.

Stefan

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-06 Thread David Kim
Hello all,

I have two questions I'm hoping someone will have the patience to
answer as an act of mercy.

I. How to get past a Terms of Service page?

I've just started learning python (have never done any programming
prior) and am trying to figure out how to open or download a website
to scrape data. The only problem is, whenever I try to open the link
(via urllib2, for example) I'm after, I end up getting the HTML to a
Terms of Service Page (where one has to click an "I Agree" button)
rather than the actual target page.

I've seen examples on the web on providing data for forms (typically
by finding the name of the form and providing some sort of dictionary
to fill in the form fields), but this simple act of getting past "I
Agree" is stumping me. Can anyone save my sanity? As a workaround,
I've been using os.popen('curl ' + url ' >' filename) to save the html
in a txt file for later processing. I have no idea why curl works and
urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo
Pipes to try and sidestep coding anything altogether, but ended up
looking at the same Terms of Service page anyway.

Here's the code (tho it's probably not that illuminating since it's
basically just opening a url):

import urllib2
url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
#the first of 23 tables
html = urllib2.urlopen(url).read()

II. How to parse html tables with lxml, beautifulsoup? (for dummies)

Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.
I've tried looking at the documentation included with different python
libraries, but just got more confused.

The basic tutorials show something like the following:

from lxml import html
doc = html.parse("/path/to/test.txt") #the file i downloaded via curl
root = doc.getroot() #what is this root business?
tables = root.cssselect('table')

I understand that selecting all the table tags will somehow target
however many tables on the page. The problem is the table has multiple
headers, empty cells, etc. Most of the examples on the web have to do
with scraping the web for search results or something that don't
really depend on the table format for anything other than layout. Are
there any resources out there that are appropriate for web/python
illiterati like myself that deal with structured data as in the url
above?

FYI, the data in the url above goes up in smoke every week, so I'm
trying to capture it automatically on a weekly basis. Getting all of
it into a CSV or database would be a personal cause for celebration as
it would be the first really useful thing I've done with python since
starting to learn it a few months ago.

For anyone who is interested, here is the code that uses "curl" to
pull the webpages. It basically just builds the url string for the
different table-pages and saves down the file with a timestamped
filename:

import os
from time import strftime

BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_'
SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)},
'section2':{'select':'ii.php?id=table', 'id':range(9,17)},
'section3':{'select':'iii.php?id=table', 'id':range(17,24)}
}

def get_pages():

filenames = []
path = '~/Dev/Data/DTCC_DerivServ/'
#os.popen('cd ' + path)

for section in SECTIONS:
for id in SECTIONS[section]['id']:
#urlList.append(BASE_URL + SECTIONS[section]['select']+str(id))
url = BASE_URL + SECTIONS[section]['select'] + str(id)
timestamp = strftime('%Y%m%d_')
#sectionName = BASE_URL.split('/')[-1]
sectionNumber = SECTIONS[section]['select'].split('.')[0]
tableNumber = str(id) + '_'
filename = timestamp + tableNumber + sectionNumber + '.txt'
os.popen('curl ' + url + '> ' + path + filename)
filenames.append(filename)

return filenames

if (__name__ == '__main__'):
get_pages()


--
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor