Massi wrote:
Hi everyone, I'm using the urllib2 library to get the html source code
of web pages. In general it works great, but I'm having to do with a
financial web site which does not provide the souce code I expect. As
a matter of fact if you try:

import urllib2
res = urllib2.urlopen("http://www.marketwatch.com/story/mondays-
biggest-gaining-and-declining-stocks-2009-07-27")
page = res.read()
print page

you will see that the printed code is very different from the one
given, for example, by mozilla. Since I have really little knowledge
in html I can't even understand if this is a python or html problem.
Can anyone give me some help?
Thanks in advance.

I don't think this is a Python issue, but a "raw read" versus an interactive interpretation of a page. The browser does lots more than a single roundtrip defined by urlopen/read.

I also would love to see some explanation of what happens here, or a pointer to a reference that would help me understand it.

I took the output of the read(), and formatted it, roughly, as html. I expected to find a refresh, which is the simplest way that one page can cause a very different one to be loaded.
     <meta http-equiv="refresh" content="1;url=someotherurl" />

If Mozilla had seen a page with this line in an appropriate place, it'd immediately begin loading the other page, at "someotherurl" But there's no such line.

Next, I looked for javascript. The Mozilla page contains lots of javascript, but there's none in the raw page. So I can't explain Mozilla's differences that way.

I did notice the link to /m/Content/mobile2.css, but I don' t know any way a CSS file could cause the content to change, just the display.

All I can guess is that it has something to do with "browser type" or cookies. And that would make lots of sense if this was a cgi page. But the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of another dozen special suffixes.

Any hints, anybody???

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to