On May 16, 11:43�pm, Sanoski <[EMAIL PROTECTED]> wrote: > I'm pretty new to programming. I've just been studying a few weeks off > and on. I know a little, and I'm learning as I go. Programming is so > much fun! I really wish I would have gotten into it years ago, but > here's my question. I have a longterm project in mind, and I wont to > know if it's feasible and how difficult it will be. > > There's an XML feed for my school that some other class designed. It's > just a simple idea that lists all classes, room number, and the person > with the highest GPA. The feed is set up like this. Each one of the > following lines would also be a link to more information about the > class, etc. > > Economics, Room 216, James Faker, 3.4 > Social Studies, Room 231, Brain Fictitious, 3.5 > > etc, etc > > The student also has a picture reference that depicts his GPA based on > the number. The picture is basically just a graph. I just want to > write a program that uses the information on this feed. > > I want it to reach out to this XML feed, record each instance of the > above format along with the picture reference of the highest GPA > student, download it locally, and then be able to use that information > in various was. I figured I'll start by counting each instance. For > example, the above would be 2 instances. > > Eventually, I want it to be able to cross reference data you've > already downloaded, and be able to compare GPA's, etc. It would have a > GUI and everything too, but I am trying to keep it simple right now, > and just build onto it as I learn. > > So lets just say this. How do you grab information from the web,
Depends on the web page. > in this case a feed, Haven't tried that, just a simple CGI. > and then use that in calculations? The key is some type of structure be it database records, or a list of lists or whatever. Something that you can iterate through, sort, find max element, etc. > How would you > implement such a project? The example below uses BeautifulSoup. I'm posting it not because it matches your problem, but to give you an idea of the techniques involved. > Would you save the information into a text file? Possibly, but generally no. Text files aren't very useful except as a data exchange media. > Or would you use something else? Your application lends itself to a database approach. Note in my example the database part of the code is disabled. Not every one has MS-Access on Windows. > Should I study up on SQLite? Yes. The MS-Access code I have can be easily changed to SQLlite. > Maybe I should study classes. I don't know, but I've always gotten along without them. > I'm just not sure. What would be the most effective technique? Don't know that either as I've only done it once, as follows: ## I was looking in my database of movie grosses I regulary copy ## from the Internet Movie Database and noticed I was _only_ 120 ## weeks behind in my updates. ## ## Ouch. ## ## Copying a web page, pasting into a text file, running a perl ## script to convert it into a csv file and manually importing it ## into Access isn't so bad when you only have a couple to do at ## a time. Still, it's a labor intensive process and 120 isn't ## anything to look forwards to. ## ## But I abandoned perl years ago when I took up Python, so I ## can use Python to completely automate the process now. ## ## Just have to figure out how. ## ## There's 3 main tasks: capture the web page, parse the web page ## to extract the data and insert the data into the database. ## ## But I only know how to do the last step, using the odnc tools ## from win32, ####import dbi ####import odbc import re ## so I snoop around comp.lang.python to pick up some ## hints and keywords on how to do the other two tasks. ## ## Documentation on urllib2 was a bit vague, but got the web page ## after only a ouple mis-steps. import urllib2 ## Unfortunately, HTMLParser remained beyond my grasp (is it ## my imagination or is the quality of the examples in the ## doumentation inversely proportional to the subject ## difficulty?) ## ## Luckily, my bag of hints had a reference to Beautiful Soup, ## whose web site proclaims: ## Beautiful Soup is a Python HTML/XML parser ## designed for quick turnaround projects like ## screen-scraping. ## Looks like just what I need, maybe I can figure it out after all. from BeautifulSoup import BeautifulSoup target_dates = [['4','6','2008','April']] ####con = odbc.odbc("IMDB") # connect to MS-Access database ####cursor = con.cursor() for d in target_dates: # # build url (with CGI parameters) from list of dates needing updating # the_year = d[2] the_date = '/'.join([d[0],d[1],d[2]]) print '%10s scraping IMDB:' % (the_date), the_url = ''.join([r'http://www.imdb.com/BusinessThisDay? day=',d[1],'&month=',d[3]]) req = urllib2.Request(url=the_url) f = urllib2.urlopen(req) www = f.read() # # ok, page captured. now make a BeatifulSoup object from it # soup = BeautifulSoup(www) # # that was easy, much more so than HTMLParser # # now, _all_ I have to do is figure out how to parse it # # ouch again. this is a lot harder than it looks in the # documentation. I need to get the data from cells of a # table nested inside another table and that's hard to # extrapolate from the examples showing how to find all # the comments on a web page. # # but this looks promising. if I grab all the table rows # (tr tags), each complete nested table is inside a cell # of the outer table (whose table tags are lost, but aren't # needed and whose absence makes extracting the nested # tables easier (when you do it the stupid way, but hey, # it works, so I'm sticking with it)) # tr = soup.tr # table rows tr.extract() # # now, I only want the third nested table. how do I get it? # can't seem to get past the first one, should I be using # NextSibling or something? <scratches head...> # # but wait...I don't need the first two tables, so I can # simply extract and discard them. and since .extract() # CUTS the tables, after two extractions the table I want # IS the first one. # the_table = tr.find('table') # discard the_table.extract() the_table = tr.find('table') # discard the_table.extract() the_table = tr.find('table') # weekly gross the_table.extract() # # of course, the data doesn't start in the first row, # there's formatting, header rows, etc. looks like it starts # in tr number [3] # ## >>> the_table.contents[3].td ## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas (2000)</a> </td> # # and since tags always imply the first one, the above # is equivalent to # ## >>> the_table.contents[3].contents[0] ## <td><a href="/title/tt0170016/">How the Grinch Stole Christmas (2000)</a> </td> # # and since the title is the first of three cells, the # reporting year is # ## >>> the_table.contents[3].contents[1] ## <td> <a href="/Sections/Years/2001">2001</a> </td> # # finally, the 3rd cell must contain the gross # ## >>> the_table.contents[3].contents[2] ## <td align="RIGHT"> 259,674,120</td> # # but the contents of the first two cells are anchor tags. # to get the actual title string, I need the contents of the # contents. but that's not exactly what I want either, # I don't want a list, I need a string. and the string isn't # always in the same place in the list # # summarizing, what I need is # ## print the_table.contents[3].contents[0].contents[0].contents, ## print the_table.contents[3].contents[1].contents[1].contents, ## print the_table.contents[3].contents[2].contents # # and that almost works, just a couple more tweaks and I can # shove it into the database parsed = [] for rec in the_table.contents[3:]: the_rec_type = type(rec) # some rec are NavSrings, skip if str(the_rec_type) == "<type 'instance'>": # # ok, got a real data row # TITLE_DATE = rec.contents[0].contents[0].contents # a list inside a tuple # # and that means we still have to index the contents # of the contents of the contents of the contents by # adding [0][0] to TITLE_DATE # YEAR = rec.contents[1].contents[1].contents # ditto # # this won't go into the database, just used as a filter to grab # the records associated with the posting date and discard # the others (which should already be in the database) # GROSS = rec.contents[2].contents # just a list # # one other minor glitch, that film date is part of the title # (which is of no use in the database, so it has to be pulled out # and put in a separate field # # temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\)) (.*)',str(TITLE_DATE[0][0])) temp_title = re.search('(.*?)( \()([0-9]{4}.*)(\)) (.*)',str(TITLE_DATE)) # # which works 99% of the time. unfortunately, the IMDB # consitency is somewhat dubious. the date is _supposed_ # to be at the end of the string, but sometimes it's not. # so, usually, there are only 5 groups, but you have to # allow for the fact that there may be 6 # try: the_title = temp_title.group(1) + temp_title.group(5) except: the_title = temp_title.group(1) the_gross = str(GROSS[0]) # # and for some unexplained reason, dates will occasionally # be 2001/I instead of 2001, so we want to discard the trailing # crap, if any # the_film_year = temp_title.group(3)[:4] # if str(YEAR[0][0])==the_year: if str(YEAR[0])==the_year: parsed.append([the_date,the_title,the_film_year,the_gross]) print '%3d records found ' % (len(parsed)) # # wow, now just have to insert all the update records directly # into the database...into a temporary table, of course. as I said, # IMDB consistency is somewhat dubious (such as changing the spelling # of the titles), so a QC check will be required inside Access # #### if len(parsed)>0: #### print '...inserting into database' #### for p in parsed: #### cursor.execute(""" ####INSERT INTO imdweeks2 ( Date_reported, Title, Film_Date, Gross_to_Date ) ####SELECT ?,?,?,?;""",p) #### else: #### print '...aborting, no records found' #### ####cursor.close() ####con.close() for p in parsed: print p # and just because it works, doesn't mean it's right. # but hey, you get what you pay for. I'm _sure_ if I were # to pay for a subscription to IMDBPro, I wouldn't see # these errors ;-) ##You should get this: ## ## 4/6/2008 scraping IMDB: 111 records found ##['4/6/2008', "[u'I Am Legend']", '2007', ' 256,386,216'] ##['4/6/2008', "[u'National Treasure: Book of Secrets']", '2007', ' 218,701,477'] ##['4/6/2008', "[u'Alvin and the Chipmunks']", '2007', ' 216,873,487'] ##['4/6/2008', "[u'Juno']", '2007', ' 142,545,706'] ##['4/6/2008', "[u'Horton Hears a Who!']", '2008', ' 131,076,768'] ##['4/6/2008', "[u'Bucket List, The']", '2007', ' 91,742,612'] ##['4/6/2008', "[u'10,000 BC']", '2008', ' 89,349,915'] ##['4/6/2008', "[u'Cloverfield']", '2008', ' 80,034,302'] ##['4/6/2008', "[u'Jumper']", '2008', ' 78,762,148'] ##['4/6/2008', "[u'27 Dresses']", '2008', ' 76,376,607'] ##['4/6/2008', "[u'No Country for Old Men']", '2007', ' 74,273,505'] ##['4/6/2008', "[u'Vantage Point']", '2008', ' 71,037,105'] ##['4/6/2008', "[u'Spiderwick Chronicles, The']", '2008', ' 69,872,230'] ##['4/6/2008', '[u"Fool\'s Gold"]', '2008', ' 68,636,484'] ##['4/6/2008', "[u'Hannah Montana/Miley Cyrus: Best of Both Worlds Concert Tour']", '2008', ' 65,010,561'] ##['4/6/2008', "[u'Step Up 2: The Streets']", '2008', ' 57,389,556'] ##['4/6/2008', "[u'Atonement']", '2007', ' 50,921,738'] ##['4/6/2008', "[u'21']", '2008', ' 46,770,173'] ##['4/6/2008', "[u'College Road Trip']", '2008', ' 40,918,686'] ##['4/6/2008', "[u'There Will Be Blood']", '2007', ' 40,133,435'] ##['4/6/2008', "[u'Meet the Spartans']", '2008', ' 38,185,300'] ##['4/6/2008', "[u'Meet the Browns']", '2008', ' 37,662,502'] ##['4/6/2008', "[u'Deep Sea 3D']", '2006', ' 36,141,373'] ##['4/6/2008', "[u'Semi-Pro']", '2008', ' 33,289,722'] ##['4/6/2008', "[u'Definitely, Maybe']", '2008', ' 31,973,840'] ##['4/6/2008', "[u'Eye, The']", '2008', ' 31,397,498'] ##['4/6/2008', "[u'Great Debaters, The']", '2007', ' 30,219,326'] ##['4/6/2008', "[u'Bank Job, The']", '2008', ' 26,804,821'] ##['4/6/2008', "[u'Other Boleyn Girl, The']", '2008', ' 26,051,195'] ##['4/6/2008', "[u'Drillbit Taylor']", '2008', ' 25,490,483'] ##['4/6/2008', "[u'Magnificent Desolation: Walking on the Moon 3D']", '2005', ' 23,283,158'] ##['4/6/2008', "[u'Shutter']", '2008', ' 23,138,277'] ##['4/6/2008', "[u'Never Back Down']", '2008', ' 23,080,675'] ##['4/6/2008', "[u'Mad Money']", '2008', ' 20,648,442'] ##['4/6/2008', "[u'Galapagos']", '1955', ' 17,152,405'] ##['4/6/2008', "[u'Superhero Movie']", '2008', ' 16,899,661'] ##['4/6/2008', "[u'Wild Safari 3D']", '2005', ' 16,550,933'] ##['4/6/2008', "[u'Kite Runner, The']", '2007', ' 15,790,223'] ##['4/6/2008', '[u"Nim\'s Island"]', '2008', ' 13,210,579'] ##['4/6/2008', "[u'Leatherheads']", '2008', ' 12,682,595'] ##['4/6/2008', "[u'Be Kind Rewind']", '2008', ' 11,028,439'] ##['4/6/2008', "[u'Doomsday']", '2008', ' 10,955,425'] ##['4/6/2008', "[u'Sea Monsters: A Prehistoric Adventure']", '2007', ' 10,745,308'] ##['4/6/2008', "[u'Miss Pettigrew Lives for a Day']", '2008', ' 10,534,800'] ##['4/6/2008', "[u'Môme, La']", '2007', ' 10,299,782'] ##['4/6/2008', "[u'Penelope']", '2006', ' 9,646,154'] ##['4/6/2008', "[u'Misma luna, La']", '2007', ' 8,959,462'] ##['4/6/2008', "[u'Roving Mars']", '2006', ' 8,463,161'] ##['4/6/2008', "[u'Stop-Loss']", '2008', ' 8,170,755'] ##['4/6/2008', "[u'Ruins, The']", '2008', ' 8,003,241'] ##['4/6/2008', "[u'Bella']", '2006', ' 7,776,080'] ##['4/6/2008', "[u'U2 3D']", '2007', ' 7,348,105'] ##['4/6/2008', "[u'Orfanato, El']", '2007', ' 7,159,147'] ##['4/6/2008', "[u'In Bruges']", '2008', ' 6,831,761'] ##['4/6/2008', "[u'Savages, The']", '2007', ' 6,571,599'] ##['4/6/2008', "[u'Scaphandre et le papillon, Le']", '2007', ' 5,990,075'] ##['4/6/2008', "[u'Run Fatboy Run']", '2007', ' 4,430,583'] ##['4/6/2008', "[u'Persepolis']", '2007', ' 4,200,980'] ##['4/6/2008', "[u'Charlie Bartlett']", '2007', ' 3,928,412'] ##['4/6/2008', "[u'Jodhaa Akbar']", '2008', ' 3,434,629'] ##['4/6/2008', "[u'Fälscher, Die']", '2007', ' 2,903,370'] ##['4/6/2008', "[u'Bikur Ha-Tizmoret']", '2007', ' 2,459,543'] ##['4/6/2008', "[u'Shine a Light']", '2008', ' 1,488,081'] ##['4/6/2008', "[u'Race']", '2008', ' 1,327,606'] ##['4/6/2008', "[u'Funny Games U.S.']", '2007', ' 1,274,055'] ##['4/6/2008', "[u'4 luni, 3 saptamâni si 2 zile']", '2007', ' 1,103,315'] ##['4/6/2008', "[u'Married Life']", '2007', ' 1,002,318'] ##['4/6/2008', "[u'Diary of the Dead']", '2007', ' 893,192'] ##['4/6/2008', "[u'Starting Out in the Evening']", '2007', ' 882,518'] ##['4/6/2008', "[u'Dolphins and Whales 3D: Tribes of the Ocean']", '2008', ' 854,304'] ##['4/6/2008', "[u'Sukkar banat']", '2007', ' 781,954'] ##['4/6/2008', "[u'Bonneville']", '2006', ' 471,679'] ##['4/6/2008', "[u'Flawless']", '2007', ' 390,892'] ##['4/6/2008', "[u'Paranoid Park']", '2007', ' 387,119'] ##['4/6/2008', "[u'Teeth']", '2007', ' 321,732'] ##['4/6/2008', "[u'Hammer, The']", '2007', ' 321,579'] ##['4/6/2008', "[u'Priceless']", '2008', ' 320,131'] ##['4/6/2008', "[u'Steep']", '2007', ' 259,840'] ##['4/6/2008', "[u'Honeydripper']", '2007', ' 259,192'] ##['4/6/2008', "[u'Snow Angels']", '2007', ' 255,147'] ##['4/6/2008', "[u'Taxi to the Dark Side']", '2007', ' 231,743'] ##['4/6/2008', "[u'Cheung Gong 7 hou']", '2008', ' 188,067'] ##['4/6/2008', "[u'Ne touchez pas la hache']", '2007', ' 184,513'] ##['4/6/2008', "[u'Sleepwalking']", '2008', ' 160,715'] ##['4/6/2008', "[u'Chicago 10']", '2007', ' 149,456'] ##['4/6/2008', "[u'Girls Rock!']", '2007', ' 92,636'] ##['4/6/2008', "[u'Beaufort']", '2007', ' 87,339'] ##['4/6/2008', "[u'Shelter']", '2007', ' 85,928'] ##['4/6/2008', "[u'My Blueberry Nights']", '2007', ' 74,146'] ##['4/6/2008', "[u'Témoins, Les']", '2007', ' 71,624'] ##['4/6/2008', "[u'Mépris, Le']", '1963', ' 70,761'] ##['4/6/2008', "[u'Singing Revolution, The']", '2006', ' 66,482'] ##['4/6/2008', "[u'Chop Shop']", '2007', ' 58,858'] ##['4/6/2008', '[u"Chansons d\'amour, Les"]', '2007', ' 58,577'] ##['4/6/2008', "[u'Praying with Lior']", '2007', ' 57,325'] ##['4/6/2008', "[u'Yihe yuan']", '2006', ' 57,155'] ##['4/6/2008', "[u'Casa de Alice, A']", '2007', ' 53,700'] ##['4/6/2008', "[u'Blindsight']", '2006', ' 51,256'] ##['4/6/2008', "[u'Boarding Gate']", '2007', ' 37,107'] ##['4/6/2008', "[u'Voyage du ballon rouge, Le']", '2007', ' 35,222'] ##['4/6/2008', "[u'Bill']", '2007', ' 35,201'] ##['4/6/2008', "[u'Mio fratello è figlio unico']", '2007', ' 34,138'] ##['4/6/2008', "[u'Chapter 27']", '2007', ' 32,602'] ##['4/6/2008', "[u'Meduzot']", '2007', ' 25,352'] ##['4/6/2008', "[u'Shotgun Stories']", '2007', ' 25,346'] ##['4/6/2008', "[u'Sconosciuta, La']", '2006', ' 18,569'] ##['4/6/2008', "[u'Imaginary Witness: Hollywood and the Holocaust']", '2004', ' 18,475'] ##['4/6/2008', "[u'Irina Palm']", '2007', ' 14,214'] ##['4/6/2008', "[u'Naissance des pieuvres']", '2007', ' 7,418'] ##['4/6/2008', "[u'Four Letter Word, A']", '2007', ' 6,017'] ##['4/6/2008', "[u'Tuya de hun shi']", '2006', ' 2,619'] -- http://mail.python.org/mailman/listinfo/python-list