Re: Web page data and urllib2.urlopen
> Dave Angel (DA) wrote: >DA> Piet van Oostrum wrote: >>> >DA> But the raw page didn't have any javascript. So what about that original >DA> raw page triggered additional stuff to be loaded? >DA> Is it "user agent", as someone else brought out? And is there somewhere I >DA> can read more about that aspect of things? I've mostly built very static >DA> html pages, where the server yields the same page to everybody. And some >DA> form stuff, where the user clicks on a 'submit" button to trigger a script >DA> that's not shown on the URL line. >>> >>> Yes, if you specify a 'normal' web browser as user agent you do get the >>> Javascript: >>> >>> import urllib2 >>> >>> request = >>> urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27') >>> request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X >>> 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13') >>> >>> opener = urllib2.build_opener() page = opener.open(request).read() >>> print page >>> >>> >DA> Thanks much. That's a key I didn't understand. You can even specify the headers in the Request constructor: url = 'http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27' hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13'} request = urllib2.Request(url = url, headers = hdr) -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Web page data and urllib2.urlopen
Piet van Oostrum wrote: DA> But the raw page didn't have any javascript. So what about that original DA> raw page triggered additional stuff to be loaded? DA> Is it "user agent", as someone else brought out? And is there somewhere I DA> can read more about that aspect of things? I've mostly built very static DA> html pages, where the server yields the same page to everybody. And some DA> form stuff, where the user clicks on a 'submit" button to trigger a script DA> that's not shown on the URL line. Yes, if you specify a 'normal' web browser as user agent you do get the Javascript: import urllib2 request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27') request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13') opener = urllib2.build_opener() page = opener.open(request).read() print page Thanks much. That's a key I didn't understand. DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: Web page data and urllib2.urlopen
> Dave Angel (DA) wrote: >DA> Piet van Oostrum wrote: >>> >DA> If Mozilla had seen a page with this line in an appropriate place, it'd >DA> immediately begin loading the other page, at "someotherurl" But there's no >DA> such line. >>> >>> >DA> Next, I looked for javascript. The Mozilla page contains lots of >DA> javascript, but there's none in the raw page. So I can't explain Mozilla's >DA> differences that way. >>> >>> >DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way >DA> a CSS file could cause the content to change, just the display. >>> >>> >DA> All I can guess is that it has something to do with "browser type" or >DA> cookies. And that would make lots of sense if this was a cgi page. But >DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of >DA> another dozen special suffixes. >>> >>> >DA> Any hints, anybody??? >>> >>> If you look into the HTML that Firefox gets, there is a lot of >>> javascript in it. >>> >DA> But the raw page didn't have any javascript. So what about that original >DA> raw page triggered additional stuff to be loaded? >DA> Is it "user agent", as someone else brought out? And is there somewhere I >DA> can read more about that aspect of things? I've mostly built very static >DA> html pages, where the server yields the same page to everybody. And some >DA> form stuff, where the user clicks on a 'submit" button to trigger a script >DA> that's not shown on the URL line. Yes, if you specify a 'normal' web browser as user agent you do get the Javascript: import urllib2 request = urllib2.Request('http://www.marketwatch.com/story/mondays-biggest-gaining-and-declining-stocks-2009-07-27') request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13') opener = urllib2.build_opener() page = opener.open(request).read() print page -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Re: Web page data and urllib2.urlopen
On Fri, Aug 7, 2009 at 3:47 AM, Dave Angel wrote: > > > Piet van Oostrum wrote: >> >> >>> >>> DA> All I can guess is that it has something to do with "browser type" or >>> DA> cookies. And that would make lots of sense if this was a cgi page. >>> But >>> DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or >>> any of >>> DA> another dozen special suffixes. >>> Note that the URL does not have to have any special suffix for it to be dynamically generated. See any page at wikipedia, for example. Mediawiki, the software running the site, is a php application. >> >> >>> >>> DA> Any hints, anybody??? >>> >> >> If you look into the HTML that Firefox gets, there is a lot of >> javascript in it. >> > > But the raw page didn't have any javascript. So what about that original > raw page triggered additional stuff to be loaded? FWIW, I'm getting a ton of javascript in the page downloaded using your code fragment. > Is it "user agent", as someone else brought out? And is there somewhere I > can read more about that aspect of things? I've mostly built very static > html pages, where the server yields the same page to everybody. And some > form stuff, where the user clicks on a 'submit" button to trigger a script > that's not shown on the URL line. > -- kushal -- http://mail.python.org/mailman/listinfo/python-list
Re: Re: Web page data and urllib2.urlopen
Piet van Oostrum wrote: DA> If Mozilla had seen a page with this line in an appropriate place, it'd DA> immediately begin loading the other page, at "someotherurl" But there's no DA> such line. DA> Next, I looked for javascript. The Mozilla page contains lots of DA> javascript, but there's none in the raw page. So I can't explain Mozilla's DA> differences that way. DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way DA> a CSS file could cause the content to change, just the display. DA> All I can guess is that it has something to do with "browser type" or DA> cookies. And that would make lots of sense if this was a cgi page. But DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of DA> another dozen special suffixes. DA> Any hints, anybody??? If you look into the HTML that Firefox gets, there is a lot of javascript in it. But the raw page didn't have any javascript. So what about that original raw page triggered additional stuff to be loaded? Is it "user agent", as someone else brought out? And is there somewhere I can read more about that aspect of things? I've mostly built very static html pages, where the server yields the same page to everybody. And some form stuff, where the user clicks on a 'submit" button to trigger a script that's not shown on the URL line. -- http://mail.python.org/mailman/listinfo/python-list
Re: Web page data and urllib2.urlopen
> Dave Angel (DA) wrote: >DA> Massi wrote: >>> Hi everyone, I'm using the urllib2 library to get the html source code >>> of web pages. In general it works great, but I'm having to do with a >>> financial web site which does not provide the souce code I expect. As >>> a matter of fact if you try: >>> >>> import urllib2 >>> res = urllib2.urlopen("http://www.marketwatch.com/story/mondays- >>> biggest-gaining-and-declining-stocks-2009-07-27") >>> page = res.read() >>> print page >>> >>> you will see that the printed code is very different from the one >>> given, for example, by mozilla. Since I have really little knowledge >>> in html I can't even understand if this is a python or html problem. >>> Can anyone give me some help? >>> Thanks in advance. >>> >>> >DA> I don't think this is a Python issue, but a "raw read" versus an >DA> interactive interpretation of a page. The browser does lots more than a >DA> single roundtrip defined by urlopen/read. >DA> I also would love to see some explanation of what happens here, or a >DA> pointer to a reference that would help me understand it. >DA> I took the output of the read(), and formatted it, roughly, as html. I >DA> expected to find a refresh, which is the simplest way that one page can >DA> cause a very different one to be loaded. >DA> >DA> If Mozilla had seen a page with this line in an appropriate place, it'd >DA> immediately begin loading the other page, at "someotherurl" But there's no >DA> such line. >DA> Next, I looked for javascript. The Mozilla page contains lots of >DA> javascript, but there's none in the raw page. So I can't explain Mozilla's >DA> differences that way. >DA> I did notice the link to /m/Content/mobile2.css, but I don' t know any way >DA> a CSS file could cause the content to change, just the display. >DA> All I can guess is that it has something to do with "browser type" or >DA> cookies. And that would make lots of sense if this was a cgi page. But >DA> the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of >DA> another dozen special suffixes. >DA> Any hints, anybody??? If you look into the HTML that Firefox gets, there is a lot of javascript in it. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org -- http://mail.python.org/mailman/listinfo/python-list
Re: Web page data and urllib2.urlopen
On Aug 5, 4:30 pm, Massi wrote: > Hi everyone, I'm using the urllib2 library to get the html source code > of web pages. In general it works great, but I'm having to do with a > financial web site which does not provide the souce code I expect. As > a matter of fact if you try: > > import urllib2 > res = urllib2.urlopen("http://www.marketwatch.com/story/mondays- > biggest-gaining-and-declining-stocks-2009-07-27") > page = res.read() > print page > > you will see that the printed code is very different from the one > given, for example, by mozilla. Since I have really little knowledge > in html I can't even understand if this is a python or html problem. > Can anyone give me some help? > Thanks in advance. Check if setting your user agent to Mozilla results in a different page: http://diveintopython.org/http_web_services/user_agent.html -- http://mail.python.org/mailman/listinfo/python-list
Re: Web page data and urllib2.urlopen
Massi wrote: Hi everyone, I'm using the urllib2 library to get the html source code of web pages. In general it works great, but I'm having to do with a financial web site which does not provide the souce code I expect. As a matter of fact if you try: import urllib2 res = urllib2.urlopen("http://www.marketwatch.com/story/mondays- biggest-gaining-and-declining-stocks-2009-07-27") page = res.read() print page you will see that the printed code is very different from the one given, for example, by mozilla. Since I have really little knowledge in html I can't even understand if this is a python or html problem. Can anyone give me some help? Thanks in advance. I don't think this is a Python issue, but a "raw read" versus an interactive interpretation of a page. The browser does lots more than a single roundtrip defined by urlopen/read. I also would love to see some explanation of what happens here, or a pointer to a reference that would help me understand it. I took the output of the read(), and formatted it, roughly, as html. I expected to find a refresh, which is the simplest way that one page can cause a very different one to be loaded. If Mozilla had seen a page with this line in an appropriate place, it'd immediately begin loading the other page, at "someotherurl" But there's no such line. Next, I looked for javascript. The Mozilla page contains lots of javascript, but there's none in the raw page. So I can't explain Mozilla's differences that way. I did notice the link to /m/Content/mobile2.css, but I don' t know any way a CSS file could cause the content to change, just the display. All I can guess is that it has something to do with "browser type" or cookies. And that would make lots of sense if this was a cgi page. But the URL doesn't look like that, as it doesn't end in pl, py, asp, or any of another dozen special suffixes. Any hints, anybody??? DaveA -- http://mail.python.org/mailman/listinfo/python-list
Web page data and urllib2.urlopen
Hi everyone, I'm using the urllib2 library to get the html source code of web pages. In general it works great, but I'm having to do with a financial web site which does not provide the souce code I expect. As a matter of fact if you try: import urllib2 res = urllib2.urlopen("http://www.marketwatch.com/story/mondays- biggest-gaining-and-declining-stocks-2009-07-27") page = res.read() print page you will see that the printed code is very different from the one given, for example, by mozilla. Since I have really little knowledge in html I can't even understand if this is a python or html problem. Can anyone give me some help? Thanks in advance. -- http://mail.python.org/mailman/listinfo/python-list