Re: Web page data and urllib2.urlopen

Dave Angel Wed, 05 Aug 2009 22:27:31 -0700

Massi wrote:

Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:


import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

I don't think this is a Python issue, but a "raw read" versus aninteractive interpretation of a page. The browser does lots more than asingle roundtrip defined by urlopen/read.

I also would love to see some explanation of what happens here, or apointer to a reference that would help me understand it.

I took the output of the read(), and formatted it, roughly, as html. Iexpected to find a refresh, which is the simplest way that one page cancause a very different one to be loaded.

     <meta http-equiv="refresh" content="1;url=someotherurl" />

If Mozilla had seen a page with this line in an appropriate place, it'dimmediately begin loading the other page, at "someotherurl" But there'sno such line.

Next, I looked for javascript. The Mozilla page contains lots ofjavascript, but there's none in the raw page. So I can't explainMozilla's differences that way.

I did notice the link to /m/Content/mobile2.css, but I don' t know anyway a CSS file could cause the content to change, just the display.

All I can guess is that it has something to do with "browser type" orcookies. And that would make lots of sense if this was a cgi page. Butthe URL doesn't look like that, as it doesn't end in pl, py, asp, or anyof another dozen special suffixes.


Any hints, anybody???

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: Web page data and urllib2.urlopen

Reply via email to