>>>>> Dave Angel <da...@ieee.org> (DA) wrote: >DA> Massi wrote: >>> Hi everyone, I'm using the urllib2 library to get the html source code >>> of web pages. In general it works great, but I'm having to do with a >>> financial web site which does not provide the souce code I expect. As >>> a matter of fact if you try: >>> >>> import urllib2 >>> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays- >>> biggest-gaining-and-declining-stocks-2009-07-27") >>> page = res.read() >>> print page >>> >>> you will see that the printed code is very different from the one >>> given, for example, by mozilla. Since I have really little knowledge >>> in html I can't even understand if this is a python or html problem. >>> Can anyone give me some help? >>> Thanks in advance. >>> >>> >DA> I don't think this is a Python issue, but a "raw read" versus an >DA> interactive interpretation of a page. The browser does lots more than a >DA> single roundtrip defined by urlopen/read.
>DA> I also would love to see some explanation of what happens here, or a >DA> pointer to a reference that would help me understand it. >DA> I took the output of the read(), and formatted it, roughly, as html. I >DA> expected to find a refresh, which is the simplest way that one page can >DA> cause a very different one to be loaded. >DA> <meta http-equiv="refresh" content="1;url=someotherurl" /> >DA> If Mozilla had seen a page with this line in an appropriate place, it'd >DA> immediately begin loading the other page, at "someotherurl" But there's no >DA> such line. >DA> Next, I looked for javascript. The Mozilla page contains lots of >DA> javascript, but there's none in the raw page. So I can't explain Mozilla's >DA> differences that way. >DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way >DA> a CSS file could cause the content to change, just the display. >DA> All I can guess is that it has something to do with "browser type" or >DA> cookies. And that would make lots of sense if this was a cgi page. But >DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of >DA> another dozen special suffixes. >DA> Any hints, anybody??? If you look into the HTML that Firefox gets, there is a lot of javascript in it. -- Piet van Oostrum <p...@cs.uu.nl> URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list