Re: Web page data and urllib2.urlopen

2009-08-07 Thread Piet van Oostrum
> Dave Angel  (DA) wrote:

>DA> Piet van Oostrum wrote:
 
 
>>> 
>DA> But the raw page didn't have any javascript.  So what about that original
>DA> raw page triggered additional stuff to be loaded?
>DA> Is it "user agent", as someone else brought out?  And is there somewhere I
>DA> can read more about that aspect of things?  I've mostly built very static
>DA> html pages, where the server yields the same page to everybody.  And some
>DA> form stuff, where the  user clicks on a 'submit" button to trigger a script
>DA> that's not shown on the URL line.
 
>>> 
>>> Yes, if you specify a 'normal' web browser as user agent you do get the
>>> Javascript:
>>> 
>>> import urllib2
>>> 
>>> request = 
>>> urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
>>> request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 
>>> 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')
>>> 
>>> opener = urllib2.build_opener() page = opener.open(request).read()
>>> print page
>>> 
>>> 
>DA> Thanks much.  That's a key I didn't understand.

You can even specify the headers in the Request constructor:


url = 
'http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27'
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; 
rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13'}
request = urllib2.Request(url = url, headers = hdr)

-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Web page data and urllib2.urlopen

2009-08-07 Thread Dave Angel


Piet van Oostrum wrote:






DA> But the raw page didn't have any javascript.  So what about that original
DA> raw page triggered additional stuff to be loaded?
DA> Is it "user agent", as someone else brought out?  And is there somewhere I
DA> can read more about that aspect of things?  I've mostly built very static
DA> html pages, where the server yields the same page to everybody.  And some
DA> form stuff, where the  user clicks on a 'submit" button to trigger a script
DA> that's not shown on the URL line.



Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = 
urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 
10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener() 
page = opener.open(request).read()

print page

  

Thanks much.  That's a key I didn't understand.

DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: Web page data and urllib2.urlopen

2009-08-07 Thread Piet van Oostrum
> Dave Angel  (DA) wrote:

>DA> Piet van Oostrum wrote:
>>> 
>DA> If Mozilla had seen a page with this line in an appropriate place, it'd
>DA> immediately begin loading the other page, at "someotherurl"  But there's no
>DA> such line.
 
>>> 
>>> 
>DA> Next, I looked for javascript.  The Mozilla page contains lots of
>DA> javascript, but there's none in the raw page.  So I can't explain Mozilla's
>DA> differences that way.
 
>>> 
>>> 
>DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
>DA> a CSS file could cause the content to change, just the display.
 
>>> 
>>> 
>DA> All I can guess is that it has something to do with "browser type" or
>DA> cookies.  And that would make lots of sense if this was a cgi page.  But
>DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
>DA> another dozen special suffixes.
 
>>> 
>>> 
>DA> Any hints, anybody???
 
>>> 
>>> If you look into the HTML that Firefox gets, there is a lot of
>>> javascript in it.
>>> 

>DA> But the raw page didn't have any javascript.  So what about that original
>DA> raw page triggered additional stuff to be loaded?
>DA> Is it "user agent", as someone else brought out?  And is there somewhere I
>DA> can read more about that aspect of things?  I've mostly built very static
>DA> html pages, where the server yields the same page to everybody.  And some
>DA> form stuff, where the  user clicks on a 'submit" button to trigger a script
>DA> that's not shown on the URL line.

Yes, if you specify a 'normal' web browser as user agent you do get the
Javascript:

import urllib2

request = 
urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 
10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13')

opener = urllib2.build_opener() 
page = opener.open(request).read()
print page

-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Re: Web page data and urllib2.urlopen

2009-08-06 Thread Kushal Kumaran
On Fri, Aug 7, 2009 at 3:47 AM, Dave Angel wrote:
>
>
> Piet van Oostrum wrote:
>>
>> 
>>>
>>> DA> All I can guess is that it has something to do with "browser type" or
>>> DA> cookies.  And that would make lots of sense if this was a cgi page.
>>>  But
>>> DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or
>>> any of
>>> DA> another dozen special suffixes.
>>>

Note that the URL does not have to have any special suffix for it to
be dynamically generated.  See any page at wikipedia, for example.
Mediawiki, the software running the site, is a php application.

>>
>>
>>>
>>> DA> Any hints, anybody???
>>>
>>
>> If you look into the HTML that Firefox gets, there is a lot of
>> javascript in it.
>>
>
> But the raw page didn't have any javascript.  So what about that original
> raw page triggered additional stuff to be loaded?

FWIW, I'm getting a ton of javascript in the page downloaded using
your code fragment.

> Is it "user agent", as someone else brought out?  And is there somewhere I
> can read more about that aspect of things?  I've mostly built very static
> html pages, where the server yields the same page to everybody.  And some
> form stuff, where the  user clicks on a 'submit" button to trigger a script
> that's not shown on the URL line.
>

-- 
kushal
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Re: Web page data and urllib2.urlopen

2009-08-06 Thread Dave Angel



Piet van Oostrum wrote:



DA> If Mozilla had seen a page with this line in an appropriate place, it'd
DA> immediately begin loading the other page, at "someotherurl"  But there's no
DA> such line.



  

DA> Next, I looked for javascript.  The Mozilla page contains lots of
DA> javascript, but there's none in the raw page.  So I can't explain Mozilla's
DA> differences that way.



  

DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
DA> a CSS file could cause the content to change, just the display.



  

DA> All I can guess is that it has something to do with "browser type" or
DA> cookies.  And that would make lots of sense if this was a cgi page.  But
DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
DA> another dozen special suffixes.



  

DA> Any hints, anybody???



If you look into the HTML that Firefox gets, there is a lot of
javascript in it.
  


But the raw page didn't have any javascript.  So what about that 
original raw page triggered additional stuff to be loaded?
Is it "user agent", as someone else brought out?  And is there somewhere 
I can read more about that aspect of things?  I've mostly built very 
static html pages, where the server yields the same page to everybody.  
And some form stuff, where the  user clicks on a 'submit" button to 
trigger a script that's not shown on the URL line.




--
http://mail.python.org/mailman/listinfo/python-list


Re: Web page data and urllib2.urlopen

2009-08-06 Thread Piet van Oostrum
> Dave Angel  (DA) wrote:

>DA> Massi wrote:
>>> Hi everyone, I'm using the urllib2 library to get the html source code
>>> of web pages. In general it works great, but I'm having to do with a
>>> financial web site which does not provide the souce code I expect. As
>>> a matter of fact if you try:
>>> 
>>> import urllib2
>>> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
>>> biggest-gaining-and-declining-stocks-2009-07-27")
>>> page = res.read()
>>> print page
>>> 
>>> you will see that the printed code is very different from the one
>>> given, for example, by mozilla. Since I have really little knowledge
>>> in html I can't even understand if this is a python or html problem.
>>> Can anyone give me some help?
>>> Thanks in advance.
>>> 
>>> 
>DA> I don't think this is a Python issue, but a "raw read" versus an
>DA> interactive interpretation of a page.  The browser does lots more than a
>DA> single roundtrip defined by urlopen/read.

>DA> I also would love to see some explanation of what happens here, or a
>DA> pointer to a reference that would help me understand it.

>DA> I took the output of the read(), and formatted it, roughly, as html.  I
>DA> expected to find a refresh, which is the simplest way that one page can
>DA> cause a very different one to be loaded.
>DA>  

>DA> If Mozilla had seen a page with this line in an appropriate place, it'd
>DA> immediately begin loading the other page, at "someotherurl"  But there's no
>DA> such line.

>DA> Next, I looked for javascript.  The Mozilla page contains lots of
>DA> javascript, but there's none in the raw page.  So I can't explain Mozilla's
>DA> differences that way.

>DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way
>DA> a CSS file could cause the content to change, just the display.

>DA> All I can guess is that it has something to do with "browser type" or
>DA> cookies.  And that would make lots of sense if this was a cgi page.  But
>DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of
>DA> another dozen special suffixes.

>DA> Any hints, anybody???

If you look into the HTML that Firefox gets, there is a lot of
javascript in it.
-- 
Piet van Oostrum 
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Web page data and urllib2.urlopen

2009-08-06 Thread ryles
On Aug 5, 4:30 pm, Massi  wrote:
> Hi everyone, I'm using the urllib2 library to get the html source code
> of web pages. In general it works great, but I'm having to do with a
> financial web site which does not provide the souce code I expect. As
> a matter of fact if you try:
>
> import urllib2
> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
> biggest-gaining-and-declining-stocks-2009-07-27")
> page = res.read()
> print page
>
> you will see that the printed code is very different from the one
> given, for example, by mozilla. Since I have really little knowledge
> in html I can't even understand if this is a python or html problem.
> Can anyone give me some help?
> Thanks in advance.

Check if setting your user agent to Mozilla results in a different
page:

http://diveintopython.org/http_web_services/user_agent.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Web page data and urllib2.urlopen

2009-08-05 Thread Dave Angel

Massi wrote:

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

  
I don't think this is a Python issue, but a "raw read" versus an 
interactive interpretation of a page.  The browser does lots more than a 
single roundtrip defined by urlopen/read.


I also would love to see some explanation of what happens here, or a 
pointer to a reference that would help me understand it.


I took the output of the read(), and formatted it, roughly, as html.  I 
expected to find a refresh, which is the simplest way that one page can 
cause a very different one to be loaded.

 

If Mozilla had seen a page with this line in an appropriate place, it'd 
immediately begin loading the other page, at "someotherurl"  But there's 
no such line.


Next, I looked for javascript.  The Mozilla page contains lots of 
javascript, but there's none in the raw page.  So I can't explain 
Mozilla's differences that way.


I did notice the link to /m/Content/mobile2.css, but I don' t know any 
way a CSS file could cause the content to change, just the display.


All I can guess is that it has something to do with "browser type" or 
cookies.  And that would make lots of sense if this was a cgi page.  But 
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any 
of another dozen special suffixes.


Any hints, anybody???

DaveA

--
http://mail.python.org/mailman/listinfo/python-list


Web page data and urllib2.urlopen

2009-08-05 Thread Massi
Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.
-- 
http://mail.python.org/mailman/listinfo/python-list