## I was looking in my database of movie grosses I regulary copy ## from the Internet Movie Database and noticed I was _only_ 120 ## weeks behind in my updates. ## ## Ouch. ## ## Copying a web page, pasting into a text file, running a perl ## script to convert it into a csv file and manually importing it ## into Access isn't so bad when you only have a couple to do at ## a time. Still, it's a labor intensive process and 120 isn't ## anything to look forwards to. ## ## But I abandoned perl years ago when I took up Python, so I ## can use Python to completely automate the process now. ## ## Just have to figure out how. ## ## There's 3 main tasks: capture the web page, parse the web page ## to extract the data and insert the data into the database. ## ## But I only know how to do the last step, using the odnc tools ## from win32,
import dbi import odbc import re ## so I snoop around comp.lang.python to pick up some ## hints and keywords on how to do the other two tasks. ## ## Documentation on urllib2 was a bit vague, but got the web page ## after only a ouple mis-steps. import urllib2 ## Unfortunately, HTMLParser remained beyond my grasp (is it ## my imagination or is the quality of the examples in the ## doumentation inversely proportional to the subject ## difficulty?) ## ## Luckily, my bag of hints had a reference to Beautiful Soup, ## whose web site proclaims: ## Beautiful Soup is a Python HTML/XML parser ## designed for quick turnaround projects like ## screen-scraping. ## Looks like just what I need, maybe I can figure it out after all. from BeautifulSoup import BeautifulSoup target_dates = [['11','24','2006','November']] con = odbc.odbc("IMDB") # connect to MS-Access database cursor = con.cursor() for d in target_dates: # # build url (with CGI parameters) from list of dates needing updating # the_year = d[2] the_date = '/'.join([d[0],d[1],d[2]]) print '%10s scraping IMDB:' % (the_date), the_url = ''.join([r'http://www.imdb.com/BusinessThisDay? day=',d[1],'&month=',d[3]]) req = urllib2.Request(url=the_url) f = urllib2.urlopen(req) www = f.read() # # ok, page captured. now make a BeatifulSoup object from it # soup = BeautifulSoup(www) # # that was easy, much more so than HTMLParser # # now, _all_ I have to do is figure out how to parse it # # ouch again. this is a lot harder than it looks in the # documentation. I need to get the data from cells of a # table nested inside another table and that's hard to # extrapolate from the examples showing how to find all # the comments on a web page. # # but this looks promising. if I grab all the table rows # (tr tags), each complete nested table is inside a cell # of the outer table (whose table tags are lost, but aren't # needed and whose absence makes extracting the nested # tables easier (when you do it the stupid way, but hey, # it works, so I'm sticking with it)) # tr = soup.tr # table rows tr.extract() # # now, I only want the third nested table. how do I get it? # can't seem to get past the first one, should I be using # NextSibling or something? <scratches head...> # # but wait...I don't need the first two tables, so I can # simply extract and discard them. and since .extract() # CUTS the tables, after two extractions the table I want # IS the first one. # the_table = tr.find('table') # discard the_table.extract() the_table = tr.find('table') # discard the_table.extract() the_table = tr.find('table') # weekly gross the_table.extract() # # of course, the data doesn't start in the first row, # there's formatting, header rows, etc. looks like it starts # in tr number [3] # ## >>> the_table.contents[3].td ## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas (2000)</a> </td> # # and since tags always imply the first one, the above # is equivalent to # ## >>> the_table.contents[3].contents[0] ## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas (2000)</a> </td> # # and since the title is the first of three cells, the # reporting year is # ## >>> the_table.contents[3].contents[1] ## <td> <a href="/Sections/Years/2001">2001</a> </td> # # finally, the 3rd cell must contain the gross # ## >>> the_table.contents[3].contents[2] ## <td align="RIGHT"> 259,674,120</td> # # but the contents of the first two cells are anchor tags. # to get the actual title string, I need the contents of the # contents. but that's not exactly what I want either, # I don't want a list, I need a string. and the string isn't # always in the same place in the list # # summarizing, what I need is # ## print the_table.contents[3].contents[0].contents[0].contents, ## print the_table.contents[3].contents[1].contents[1].contents, ## print the_table.contents[3].contents[2].contents # # and that almost works, just a couple more tweaks and I can # shove it into the database parsed = [] for rec in the_table.contents[3:]: the_rec_type = type(rec) # some rec are NavSrings, skip if str(the_rec_type) == "<type 'instance'>": # # ok, got a real data row # TITLE_DATE = rec.contents[0].contents[0].contents # a list inside a tuple # # and that means we still have to index the contents # of the contents of the contents of the contents by # adding [0][0] to TITLE_DATE # YEAR = rec.contents[1].contents[1].contents # ditto # # this won't go into the database, just used as a filter to grab # the records associated with the posting date and discard # the others (which should already be in the database) # GROSS = rec.contents[2].contents # just a list # # one other minor glitch, that film date is part of the title # (which is of no use in the database), so it has to be pulled out # and put in a seperate field # temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\)) (.*)',str(TITLE_DATE[0][0])) # # which works 99% of the time. unfortunately, the IMDB # consitency is somewhat dubious. the date is _supposed_ # to be at the end of the string, but sometimes it's not. # so, usually, there are only 5 groups, but you have to # allow for the fact that there may be 6 # try: the_title = temp_title.group(1) + temp_title.group(5) except: the_title = temp_title.group(1) the_gross = str(GROSS[0]) # # and for some unexplained reason, dates will occasionally # be 2001/I instead of 2001, so we want to discard the trailing # crap, if any # the_film_year = temp_title.group(3)[:4] if str(YEAR[0][0])==the_year: parsed.append([the_date,the_title,the_film_year,the_gross]) print '%3d records found ' % (len(parsed)), # # wow, now just have to insert all the update records directly # into the database...into a temporary table, of course. as I said, # IMDB consistency is somewhat dubious (such as changing the spelling # of the titles), so a QC check will be required inside Access # if len(parsed)>0: print '...inserting into database' for p in parsed: cursor.execute(""" INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date, Gross_to_Date ) SELECT ?,?,?,?;""",p) else: print '...aborting, no records found' cursor.close() con.close() # and just because it works, doesn't mean it's right. # but hey, you get what you pay for. I'm _sure_ if I were # to pay for a subscription to IMDBPro, I wouldn't see # these errors ;-) ## 5 records found for 11/26/2006... ...inserting into database ## ## strange, only 5 films were in release over Thanksgiving weekend? ## ## of course not, IMDB screwed up (holidays being chronic problems ## for IMDB). for some reason, the others were posted on Friday ## instead of Sunday. at least that can be corrected by scraping ## the page for 11/24. ## ## mis-reported: records found on 11/24 ## 11/24/2006 scraping IMDB: 88 records found ...inserting into database ## ## other data is just plain missing (don't they do QA/QC at IMDB? ## how hard is it figure out that each week has 60-120 records?) ## ## 4 records found for 3/25/2007... ...inserting into database ## records appear to be missing ## 6 records found for 4/8/2007... ...inserting into database ## records appear to be missing ## 17 records found for 4/15/2007... ...inserting into database ## records appear to be missing ## 4 records found for 5/13/2007... ...inserting into database ## records appear to be missing ## 1 records found for 7/15/2007... ...inserting into database ## records appear to be missing ## 10/14/2007 scraping IMDB: 128 records found ...inserting into database ## 10/21/2007 scraping IMDB: 130 records found ...inserting into database ## 10/28/2007 scraping IMDB: 1 records found ...inserting into database ## records appear to be missing ## 11/4/2007 scraping IMDB: 105 records found ...inserting into database ## 11/11/2007 scraping IMDB: 2 records found ...inserting into database ## records appear to be missing ## 11/18/2007 scraping IMDB: 101 records found ...inserting into database ## other errors are legitimate. last week's grosses simply ## haven't been posted yet. and they sometimes are weeks ## behind. bet that doesn't happen with IMDBPro ;-) ## 11/25/2007 scraping IMDB: 0 records found ...aborting, no records found ## so, next update session I start with 11/25 ## there, that wasn't so bad. only took about 40 minutes ## to scrape 120 or so web pages. that's one of the things ## I like about Python. even with documentation frustration, ## I've got what I need at the end of the day -- http://mail.python.org/mailman/listinfo/python-list