Re: using urllib on a more complex site

Dave Angel Sun, 24 Feb 2013 16:33:53 -0800

On 02/24/2013 07:02 PM, Adam W. wrote:

I'm trying to write a simple script to scrape 
http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day


in order to send myself an email every day of the 99c movie of the day.

However, using a simple command like (in Python 3.0):
urllib.request.urlopen('http://www.vudu.com/movies/#tag/99centOfTheDay/99c%20Rental%20of%20the%20day').read()

I don't get the all the source I need, its just the navigation buttons.  Now I assume 
they are using some CSS/javascript witchcraft to load all the useful data later, so my 
question is how do I make urllib "wait" and grab that data as well?

The CSS and the jpegs, and many other aspects of a web "page" are loadedexplicitly, by the browser, when parsing the tags of the page youdownloaded. There is no sooner or later. The website won't send theother files until you request them.

For example, that site at the moment has one image (prob. jpeg)highlighted,

<img class="gwt-Image" src="http://images2.vudu.com/poster2/179186-m";alt="Sex and the City: The Movie (Theatrical)">

if you want to look at that jpeg, you need to download the file urlspecified by the src attribute of that img element.

Or perhaps you can just look at the 'alt' attribute, which is mainlythere for browsers who don't happen to do graphics, for example, theones for the blind.

Naturally, there may be dozens of images on the page, and there's noguarantee that the website author is trying to make it easy for you.Why not check if there's a defined api for extracting the informationyou want? Check the site, or send a message to the webmaster.

No guarantee that tomorrow, the information won't be buried in somejavascript fragment. Again, if you want to see that, you might need towrite a javascript interpreter. it could use any algorithm at all tobuild webpage information, and the encoding could change day by day, orhour by hour.


--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list

Re: using urllib on a more complex site

Reply via email to