On May 17, 4:02�am, Sanoski <[EMAIL PROTECTED]> wrote: > The reason I ask about text files is the need to save the data > locally, and have it stored in a way where backups can easily > be made.
Sure, you can always do that if you want. But if your target is SQLlite or MS-Access, those are files also, so can be backed up as easily as text files. > > Then if your computer crashes and you lose everything, but > you have the data files it uses backed up, you can just > download the program, extract the backed up data to a > specific directory, and then it works exactly the way it > did before you lost it. I suppose a SQLite database might > solve this, but I'm not sure. It will. Remember, once in a database, you have value-added features like filtering, sorting, etc. that you would have to do yourself if you simply read in text files. > I'm just getting started, and I > don't know too much about it yet. Trust me, a database is the way to go. My preference is MS-Access, because I need it for work. It is a great tool for learning databases because it's visual inteface can make you productive BEFORE you learn SQL. > > I'm also still not sure how to download and associate the pictures > that each entry has for it. See example at end of post. > The main thing for me now is getting > started. It needs to get information from the web. In this case, > it's a simple XML feed. BeautifulSoup also has an XML parser. Got to their web page and read the documentation. > The one thing that seems that would > make it easier is every post to the feed is very consistent. > Each header starts with the letter A, which stands for Alpike > Tech, follow by the name of the class, the room number, the > leading student, and his GPA. All that is one line of text. > But it's also a link to more information. For example: > > A Economics, 312, John Carbroil, 4.0 > That's one whole post to the feed. Like I say, it's very > simple and consistent. Which should make this easier. That's what you want for parsing, how to seperate a composite set of data. Simple can sometimes be done with split(), complex with regular expressions. > > Eventually I want it to follow that link and grab information > from there too, but I'll worry about that later. Technically, > if I figure this first part out, that problem should take > care of itself. > A sample picture scraper: from BeautifulSoup import BeautifulSoup import urllib2 import urllib # # start by scraping the web page # the_url="http://members.aol.com/mensanator/OHIO/TheCobs.htm" req = urllib2.Request(url=the_url) f = urllib2.urlopen(req) www = f.read() soup = BeautifulSoup(www) print soup.prettify() # # a simple page with pictures # ##<html> ## <head> ## <title> ## Ohio - The Cobs! ## </title> ## </head> ## <body> ## <h1> ## Ohio Vacation Pictures - The Cobs! ## </h1> ## <hr /> ## <img src="AUT_2784.JPG" /> ## <br /> ## WTF? ## <p> ## <img src="AUT_2764.JPG" /> ## <br /> ## This is surreal. ## </p> ## <p> ## <img src="AUT_2765.JPG" /> ## <br /> ## Six foot tall corn cobs made of concrete. ## </p> ## <p> ## <img src="AUT_2766.JPG" /> ## <br /> ## 109 of them, laid out like a modern Stonehenge. ## </p> ## <p> ## <img src="AUT_2769.JPG" /> ## <br /> ## With it's own Druid worshippers. ## </p> ## <p> ## <img src="AUT_2781.JPG" /> ## <br /> ## Cue the ## <i> ## Also Sprach Zarathustra ## </i> ## soundtrack. ## </p> ## <p> ## <img src="100_0887.JPG" /> ## <br /> ## Air & Space Museums are a dime a dozen. ## <br /> ## But there's only ## <b> ## one ## </b> ## Cobs! ## </p> ## <p> ## </p> ## </body> ##</html> # # parse the page to find all the pictures (image tags) # the_pics = soup.findAll('img') for i in the_pics: print i ##<img src="AUT_2784.JPG" /> ##<img src="AUT_2764.JPG" /> ##<img src="AUT_2765.JPG" /> ##<img src="AUT_2766.JPG" /> ##<img src="AUT_2769.JPG" /> ##<img src="AUT_2781.JPG" /> ##<img src="100_0887.JPG" /> # # the picutres have no path, so must be in the # same directory as the web page # the_jpg_path="http://members.aol.com/mensanator/OHIO/" # # now with urllib, copy the picture files to the local # hard drive renaming with sequence id at the same time # for i,j in enumerate(the_pics): p = the_jpg_path + j['src'] q = 'C:\\scrape\\' + 'pic' + str(i).zfill(4) + '.jpg' urllib.urlretrieve(p,q) # # and here's the captured files # ## C:\>dir scrape ## Volume in drive C has no label. ## Volume Serial Number is D019-C60D ## ## Directory of C:\scrape ## ## 05/17/2008 07:06 PM <DIR> . ## 05/17/2008 07:06 PM <DIR> .. ## 05/17/2008 07:05 PM 69,877 pic0000.jpg ## 05/17/2008 07:05 PM 71,776 pic0001.jpg ## 05/17/2008 07:05 PM 70,958 pic0002.jpg ## 05/17/2008 07:05 PM 69,261 pic0003.jpg ## 05/17/2008 07:05 PM 70,653 pic0004.jpg ## 05/17/2008 07:05 PM 70,564 pic0005.jpg ## 05/17/2008 07:05 PM 113,356 pic0006.jpg ## 7 File(s) 536,445 bytes ## 2 Dir(s) 27,823,570,944 bytes free -- http://mail.python.org/mailman/listinfo/python-list