[Tutor] Feedparser and Google News feeds
I have been working through some of the examples in the Programming Collective Intelligence book by Toby Segaran. I highly recommend it, btw. Anyway, some of the exercises use feedparser to pull in RSS/Atom feeds from different sources (before doing more interesting things). The algorithm stuff I pretty much follow, but one thing is driving me CRAZY: I can't seem to pull more than 10 items from a google news feed. For example, I'd like to pull 1000 google news items (using some search term, let's say 'lightsabers'). The associated atom feed url, however, only holds ten items. And its hard to do some of the clustering analysis with only ten items! Anyway, I imagine this must be a straightforward thing and I'm being a moron, but I don't know where else to ask this question (none of my friends are web-savvy programmers). I did see some posts about an n=100 term one can add to the url (the limit seems to be 100 items), but it only seems to effect the webpage view and not the feed. I've also tried subscribing to the feed in Google Reader and making the feed public, but I seem to be running into the same problem. Is this a feedparser thing or a google thing? The url I'm using is http://news.google.com/news?pz=1cf=allned=ushl=enas_scoring=ras_maxm=3q=health+information+exchangeas_qdr=aas_drrb=qas_mind=8as_minm=2cf=allas_maxd=100output=rss Can anyone help me? I'm tearing my hair out and want to choke my computer. It's probably not relevant, but I'm running Snow Leopard and Python 2.6 (actually EPD 6.1). ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Feedparser and google news/google reader
I have been working through some of the examples in the Programming Collective Intelligence book by Toby Segaran. I highly recommend it, btw. Anyway, one of the simple exercises required is using feedparser to pull in RSS/Atom feeds from different sources (before doing more interesting things). The algorithm stuff I pretty much follow, but one thing is driving me CRAZY. I can't seem to pull more than 10 items from a google news feed. For example, I'd like to pull 1000 google news items (using some search term, let's say 'lightsabers'). The associated atom feed url, however, only holds ten items. And its hard to do some of the clustering exercises with only ten items! Anyway, I imagine this must be a straightforward thing and I'm being a moron, but I don't know where else to ask this question. I did see some posts about an n=100 term one can add to the url (the limit seems to be 100 items), but it only seems to effect the webpage view and not the feed. I've also tried subscribing to the feed in Google Reader and making the feed public, but I seem to be running into the same problem. Is this a feedparser thing or a google thing? The url I'm using is http://news.google.com/news?pz=1cf=allned=ushl=enas_scoring=ras_maxm=3q=health+information+exchangeas_qdr=aas_drrb=qas_mind=8as_minm=2cf=allas_maxd=100output=rss Can anyone help me? I'm tearing my hair out and want to choke my computer. It's probably not relevant, but I'm running Snow Leopard and Python 2.6 (actually EPD 6.1). ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing html tables and using numpy for subsequent processing
Gerard wrote: Not very pretty, but I imagine there are very few pretty examples of this kind of thing. I'll add more comments...honest. Nothing obviously wrong with your code to my eyes. Many thanks gerard, appreciate you looking it over. I'll take a look at the link you posted as well (I'm traveling at the moment). Cheers, -- David Kim I hear and I forget. I see and I remember. I do and I understand. -- Confucius morenotestoself.wordpress.com financialpython.wordpress.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing html tables and using numpy for subsequent processing
Hello all, I've finally gotten around to my 'learn how to parse html' project. For those of you looking for examples (like me!), hopefully it will show you one potentially thickheaded way to do it. For those of you with powerful python-fu, I would appreciate any feedback regarding the direction I'm taking and obvious coding no-no's (I have no formal training in computer science). Please note the project is unfinished, so there isn't a nice, neat result quite yet. Rather than spam the list with a long description, please visit the following post where I outline my approach and provide necessary links -- http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/ The code can be found at pastebin: http://financialpython.pastebin.com/f4efd8930 The original html can be found at http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and parsing tables from all three sections). Many thanks! -- DK ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Help deciding between python and ruby
On Fri, 2009-09-04 at 06:18 -0700, dan06 wrote: I'd like to learn a programming language - and I need help deciding between python and ruby. I'm interesting in learning what are the difference, both objective and subjective, between the two languages. I know this is a python mailing list, so knowledge/experience with ruby may be limited - in which case I'd still be interested in learning why members of this mailing list chose python over (or in addition to) any other programming language. I look forward to the feedback/insight. I'm a non-programmer that went through this process not too long ago (leaving a broken trail of google searches and books in my wake). I looked at a number of languages, Ruby included. I chose to focus on Python first because of it's relatively clear syntax and what seems like more mature scientific modules/distributions (Scipy, Numpy, EPD, Sage, and related packages). Much of the secondary material I found for Ruby focused on Rails and web-related endeavors (tho people obviously use Ruby for many things). Of course, Python offers its own web frameworks (e.g., Django, web2py) and probably plenty of other packages I haven't even encountered yet. I have no pet peeves when it comes to syntax (e.g., some people don't like significant whitespace, etc.). I was just looking for 1) a language that isn't too hard to learn and 2) a language that is flexible enough that future exploration/growth wouldn't be a huge pain in the ass. So, for example, pulling and storing data is relatively straightforward, but can I analyze it? If I do that, can I visualize the analysis? And can I automatically generate a presentation using these visualizations if I want to? If I then want to convert this presentation into a data-driven website, can I do that? Etc., etc., etc...One can do all of this in any language, but Python offered the best productivity-to-PITA ratio (to me, at least). So it all obviously depends on what you want to do, but those were my reasons. Both Ruby and Python were attractive, I just decided that Python's scientific ecosystem was the deciding factor. I am now looking at R to plug some holes, since no language is perfect ;) I'm interested to see how the MacRuby project develops as they are moving to an LLVM-based architecture that is expected to improve Ruby's performance a lot (mirrored by similar efforts by the JRuby team and others). I'm getting a bit out over my skis now, so I'll stop there. Hope it helps, dk -- morenotestoself.wordpress.com financialpython.wordpress.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] python's database
I don't know how much it's in use, but I thought gadfly ( http://gadfly.sourceforge.net/) was the db that's implemented in python. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Automating creation of presentation slides?
Hi everyone, I'm wondering what people consider the most efficient and brain-damage free way to automate the creation of presentation slides with Python. Scripting Powerpoint via COM? Generating a Keynote XML file? Something else that's easier (hopefully)? I'm imagining a case where one does certain analyses periodically with updated data, with charts generated by matplotlib (or something similar) that ultimately need to end up in the presentation. Many thanks, DK ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Automating creation of presentation slides?
Thanks for the suggestion Kent, I was not familiar with reStructuredText. Looks very interesting and more practical than than scripting Powerpoint or recreating Apple's Keynote XML format. This nugget also led me to rst2pdfhttp://code.google.com/p/rst2pdf/, for those who care. Cheers, DK On Wed, Aug 12, 2009 at 8:52 AM, Kent Johnson ken...@tds.net wrote: On Wed, Aug 12, 2009 at 12:51 AM, David Kimdavidki...@gmail.com wrote: Hi everyone, I'm wondering what people consider the most efficient and brain-damage free way to automate the creation of presentation slides with Python. Scripting Powerpoint via COM? Generating a Keynote XML file? Something else that's easier (hopefully)? Maybe this: http://pypi.python.org/pypi/bruce Kent -- morenotestoself.wordpress.com financialpython.wordpress.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] building database with sqlite3 and matplotlib
Thanks so much for the comments! I appreciate the look. It's hard to know what the best practices are (or if they even exist). On Sat, Aug 8, 2009 at 2:28 PM, Kent Johnson ken...@tds.net wrote: You don't seem to actually have a main(). Are you running this by importing it? I would make a separate function to create the database and call that from build_database(). I was running it as a standalone to see if it worked but forgot to move the code to main(). I cut but never pasted! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] building database with sqlite3 and matplotlib
Hello everyone, I've been learning python in a vacuum for the past few months and I was wondering whether anyone would be willing to take a look at some code? I've been messing around with sqlite and matplotlib, but I couldn't get all the string substitution (using ?s). I ended up getting the script to work, but i stupidly didn't save the version of the script that I thought would work but didn't. (did that make sense?). The code can be found at http://notestoself.posterous.com/use-python-and-sqlite3-to-build-a-database-co A short summary of what I did is at http://notestoself.posterous.com/use-python-and-sqlite3-to-build-a-database (Or should I have pasted the code in this message?) I've been trying to learn from books, but some critique would be very appreciated. I'm just trying to get a sense of whether I'm doing things in a unnecessarily convoluted way. -- morenotestoself.wordpress.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!
Thanks Kent, perhaps I'll cool the Python jets and move on to HTTP and HTML. I was hoping it would be something I could just pick up along the way, looks like I was wrong. dk On Tue, Jul 7, 2009 at 1:56 PM, Kent Johnsonken...@tds.net wrote: On Tue, Jul 7, 2009 at 1:20 PM, David Kimdavidki...@gmail.com wrote: On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnsonken...@tds.net wrote: curl works because it ignores the redirect to the ToS page, and the site is (astoundingly) dumb enough to serve the content with the redirect. You could make urllib2 behave the same way by defining a 302 handler that does nothing. Many thanks for the redirect pointer! I also found http://diveintopython.org/http_web_services/redirects.html. Is the handler class on this page what you mean by a handler that does nothing? (It looks like it exposes the error code but still follows the redirect). No, all of those examples are handling the redirect. The SmartRedirectHandler just captures additional status. I think you need something like this: class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): return None def http_error_302(self, req, fp, code, msg, headers): return None I guess i'm still a little confused since, if the handler does nothing, won't I still go to the ToS page? No, it is the action of the handler, responding to the redirect request, that causes the ToS page to be fetched. For example, I ran the following code (found at http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect) That is pretty similar to the DiP code... I suspect I am not understanding something basic about how urllib2 deals with this redirect issue since it seems everything I try gives me the same ToS page. Maybe you don't understand how redirect works in general... Generally you have to post to the same url as the form, giving the same data the form does. You can inspect the source of the form to figure this out. In this case the form is form method=post action=/products/consent.php input type=hidden value=tiwd/products/derivserv/data_table_i.php name=urltarget/ input type=hidden value=1 name=check_one/ input type=hidden value=tiwdata name=tag/ input type=submit value=I Agree name=acknowledgement/ input type=submit value=Decline name=acknowledgement/ /form You generally need to enable cookie support in urllib2 as well, because the site will use a cookie to flag that you saw the consent form. This tutorial shows how to enable cookies and submit form data: http://personalpages.tds.net/~kent37/kk/00010.html I have seen the login examples where one provides values for the fields username and password (thanks Kent). Given the form above, however, it's unclear to me how one POSTs the form data when you aren't actually passing any parameters. Perhaps this is less of a Python question and more an http question (which unfortunately I know nothing about either). Yes, the parameters are listed in the form. If you don't have at least a basic understanding of HTTP and HTML you are going to have trouble with this project... Kent -- morenotestoself.wordpress.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!
Hello all, I have two questions I'm hoping someone will have the patience to answer as an act of mercy. I. How to get past a Terms of Service page? I've just started learning python (have never done any programming prior) and am trying to figure out how to open or download a website to scrape data. The only problem is, whenever I try to open the link (via urllib2, for example) I'm after, I end up getting the HTML to a Terms of Service Page (where one has to click an I Agree button) rather than the actual target page. I've seen examples on the web on providing data for forms (typically by finding the name of the form and providing some sort of dictionary to fill in the form fields), but this simple act of getting past I Agree is stumping me. Can anyone save my sanity? As a workaround, I've been using os.popen('curl ' + url ' ' filename) to save the html in a txt file for later processing. I have no idea why curl works and urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo Pipes to try and sidestep coding anything altogether, but ended up looking at the same Terms of Service page anyway. Here's the code (tho it's probably not that illuminating since it's basically just opening a url): import urllib2 url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1' #the first of 23 tables html = urllib2.urlopen(url).read() II. How to parse html tables with lxml, beautifulsoup? (for dummies) Assuming i get past the Terms of Service, I'm a bit overwhelmed by the need to know XPath, CSS, XML, DOM, etc. to scrape data from the web. I've tried looking at the documentation included with different python libraries, but just got more confused. The basic tutorials show something like the following: from lxml import html doc = html.parse(/path/to/test.txt) #the file i downloaded via curl root = doc.getroot() #what is this root business? tables = root.cssselect('table') I understand that selecting all the table tags will somehow target however many tables on the page. The problem is the table has multiple headers, empty cells, etc. Most of the examples on the web have to do with scraping the web for search results or something that don't really depend on the table format for anything other than layout. Are there any resources out there that are appropriate for web/python illiterati like myself that deal with structured data as in the url above? FYI, the data in the url above goes up in smoke every week, so I'm trying to capture it automatically on a weekly basis. Getting all of it into a CSV or database would be a personal cause for celebration as it would be the first really useful thing I've done with python since starting to learn it a few months ago. For anyone who is interested, here is the code that uses curl to pull the webpages. It basically just builds the url string for the different table-pages and saves down the file with a timestamped filename: import os from time import strftime BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_' SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)}, 'section2':{'select':'ii.php?id=table', 'id':range(9,17)}, 'section3':{'select':'iii.php?id=table', 'id':range(17,24)} } def get_pages(): filenames = [] path = '~/Dev/Data/DTCC_DerivServ/' #os.popen('cd ' + path) for section in SECTIONS: for id in SECTIONS[section]['id']: #urlList.append(BASE_URL + SECTIONS[section]['select']+str(id)) url = BASE_URL + SECTIONS[section]['select'] + str(id) timestamp = strftime('%Y%m%d_') #sectionName = BASE_URL.split('/')[-1] sectionNumber = SECTIONS[section]['select'].split('.')[0] tableNumber = str(id) + '_' filename = timestamp + tableNumber + sectionNumber + '.txt' os.popen('curl ' + url + ' ' + path + filename) filenames.append(filename) return filenames if (__name__ == '__main__'): get_pages() -- morenotestoself.wordpress.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor