Massi wrote:
Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:
import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page
you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.
I don't think this is a Python issue, but a "raw read" versus an
interactive interpretation of a page. The browser does lots more than a
single roundtrip defined by urlopen/read.
I also would love to see some explanation of what happens here, or a
pointer to a reference that would help me understand it.
I took the output of the read(), and formatted it, roughly, as html. I
expected to find a refresh, which is the simplest way that one page can
cause a very different one to be loaded.
<meta http-equiv="refresh" content="1;url=someotherurl" />
If Mozilla had seen a page with this line in an appropriate place, it'd
immediately begin loading the other page, at "someotherurl" But there's
no such line.
Next, I looked for javascript. The Mozilla page contains lots of
javascript, but there's none in the raw page. So I can't explain
Mozilla's differences that way.
I did notice the link to /m/Content/mobile2.css, but I don' t know any
way a CSS file could cause the content to change, just the display.
All I can guess is that it has something to do with "browser type" or
cookies. And that would make lots of sense if this was a cgi page. But
the URL doesn't look like that, as it doesn't end in pl, py, asp, or any
of another dozen special suffixes.
Any hints, anybody???
DaveA
--
http://mail.python.org/mailman/listinfo/python-list