[Tutor] Feedparser and Google News feeds

2010-03-10 Thread David Kim
I have been working through some of the examples in the Programming
Collective Intelligence book by Toby Segaran. I highly recommend it, btw.

Anyway, some of the exercises use feedparser to pull in RSS/Atom feeds from
different sources (before doing more interesting things). The algorithm
stuff I pretty much follow, but one thing is driving me CRAZY: I can't seem
to pull more than 10 items from a google news feed. For example, I'd like to
pull 1000 google news items (using some search term, let's say
'lightsabers'). The associated atom feed url, however, only holds ten items.
And its hard to do some of the clustering analysis with only ten items!

Anyway, I imagine this must be a straightforward thing and I'm being a
moron, but I don't know where else to ask this question (none of my friends
are web-savvy programmers). I did see some posts about an n=100 term one can
add to the url (the limit seems to be 100 items), but it only seems to
effect the webpage view and not the feed. I've also tried subscribing to the
feed in Google Reader and making the feed public, but I seem to be running
into the same problem. Is this a feedparser thing or a google thing?

The url I'm using is
http://news.google.com/news?pz=1cf=allned=ushl=enas_scoring=ras_maxm=3q=health+information+exchangeas_qdr=aas_drrb=qas_mind=8as_minm=2cf=allas_maxd=100output=rss

Can anyone help me? I'm tearing my hair out and want to choke my computer.
It's probably not relevant, but I'm running Snow Leopard and Python 2.6
(actually EPD 6.1).
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Feedparser and google news/google reader

2010-03-10 Thread David Kim
I have been working through some of the examples in the Programming
Collective Intelligence book by Toby Segaran. I highly recommend it, btw.

Anyway, one of the simple exercises required is using feedparser to pull in
RSS/Atom feeds from different sources (before doing more interesting
things). The algorithm stuff I pretty much follow, but one thing is driving
me CRAZY. I can't seem to pull more than 10 items from a google news feed.
For example, I'd like to pull 1000 google news items (using some search
term, let's say 'lightsabers'). The associated atom feed url, however, only
holds ten items. And its hard to do some of the clustering exercises with
only ten items!

Anyway, I imagine this must be a straightforward thing and I'm being a
moron, but I don't know where else to ask this question. I did see some
posts about an n=100 term one can add to the url (the limit seems to be 100
items), but it only seems to effect the webpage view and not the feed. I've
also tried subscribing to the feed in Google Reader and making the feed
public, but I seem to be running into the same problem. Is this a feedparser
thing or a google thing?

The url I'm using is
http://news.google.com/news?pz=1cf=allned=ushl=enas_scoring=ras_maxm=3q=health+information+exchangeas_qdr=aas_drrb=qas_mind=8as_minm=2cf=allas_maxd=100output=rss

Can anyone help me? I'm tearing my hair out and want to choke my computer.
It's probably not relevant, but I'm running Snow Leopard and Python 2.6
(actually EPD 6.1).
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing html tables and using numpy for subsequent processing

2009-09-17 Thread David Kim

 Gerard wrote:
 Not very pretty, but I imagine there are very few pretty examples of
 this kind of thing. I'll add more comments...honest. Nothing obviously
 wrong with your code to my eyes.


Many thanks gerard, appreciate you looking it over. I'll take a look at the
link you posted as well (I'm traveling at the moment).

Cheers,

-- 
David Kim

I hear and I forget. I see and I remember. I do and I understand. --
 Confucius

morenotestoself.wordpress.com
financialpython.wordpress.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing html tables and using numpy for subsequent processing

2009-09-15 Thread David Kim
Hello all,

I've finally gotten around to my 'learn how to parse html' project. For
those of you looking for examples (like me!), hopefully it will show you one
potentially thickheaded way to do it.

For those of you with powerful python-fu, I would appreciate any feedback
regarding the direction I'm taking and obvious coding no-no's (I have no
formal training in computer science). Please note the project is unfinished,
so there isn't a nice, neat result quite yet.

Rather than spam the list with a long description, please visit the
following post where I outline my approach and provide necessary links --
http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/

The code can be found at pastebin:
http://financialpython.pastebin.com/f4efd8930
The original html can be found at
http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and
parsing tables from all three sections).

Many thanks!

-- DK
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Help deciding between python and ruby

2009-09-04 Thread David Kim
On Fri, 2009-09-04 at 06:18 -0700, dan06 wrote:
 I'd like to learn a programming language - and I need help deciding between
 python and ruby. I'm interesting in learning what are the difference, both
 objective and subjective, between the two languages. I know this is a python
 mailing list, so knowledge/experience with ruby may be limited - in which
 case I'd still be interested in learning why members of this mailing list
 chose python over (or in addition to) any other programming language. I look
 forward to the feedback/insight.

I'm a non-programmer that went through this process not too long ago
(leaving a broken trail of google searches and books in my wake). I
looked at a number of languages, Ruby included. I chose to focus on
Python first because of it's relatively clear syntax and what seems
like more mature scientific modules/distributions (Scipy, Numpy, EPD,
Sage, and related packages). Much of the secondary material I found
for Ruby focused on Rails and web-related endeavors (tho people
obviously use Ruby for many things). Of course, Python offers its own
web frameworks (e.g., Django, web2py) and probably plenty of other
packages I haven't even encountered yet.

I have no pet peeves when it comes to syntax (e.g., some people don't
like significant whitespace, etc.). I was just looking for 1) a
language that isn't too hard to learn and 2) a language that is
flexible enough that future exploration/growth wouldn't be a huge pain
in the ass. So, for example, pulling and storing data is relatively
straightforward, but can I analyze it? If I do that, can I visualize
the analysis? And can I automatically generate a presentation using
these visualizations if I want to? If I then want to convert this
presentation into a data-driven website, can I do that? Etc., etc.,
etc...One can do all of this in any language, but Python offered the
best productivity-to-PITA ratio (to me, at least).

So it all obviously depends on what you want to do, but those were my
reasons. Both Ruby and Python were attractive, I just decided that
Python's scientific ecosystem was the deciding factor. I am now
looking at R to plug some holes, since no language is perfect ;) I'm
interested to see how the MacRuby project develops as they are moving
to an LLVM-based architecture that is expected to improve Ruby's
performance a lot (mirrored by similar efforts by the JRuby team and
others). I'm getting a bit out over my skis now, so I'll stop there.
Hope it helps,

dk

-- 
morenotestoself.wordpress.com
financialpython.wordpress.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] python's database

2009-08-16 Thread David Kim
I don't know how much it's in use, but I thought gadfly (
http://gadfly.sourceforge.net/) was the db that's implemented in python.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Automating creation of presentation slides?

2009-08-12 Thread David Kim
Hi everyone,

I'm wondering what people consider the most efficient and brain-damage free
way to automate the creation of presentation slides with Python. Scripting
Powerpoint via COM? Generating a Keynote XML file? Something else that's
easier (hopefully)?

I'm imagining a case where one does certain analyses periodically with
updated data, with charts generated by matplotlib (or something similar)
that ultimately need to end up in the presentation.

Many thanks,

DK
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Automating creation of presentation slides?

2009-08-12 Thread David Kim
Thanks for the suggestion Kent, I was not familiar with reStructuredText.
Looks very interesting and more practical than than scripting Powerpoint or
recreating Apple's Keynote XML format. This nugget also led me to
rst2pdfhttp://code.google.com/p/rst2pdf/,
for those who care.

Cheers,

DK

On Wed, Aug 12, 2009 at 8:52 AM, Kent Johnson ken...@tds.net wrote:

 On Wed, Aug 12, 2009 at 12:51 AM, David Kimdavidki...@gmail.com wrote:
  Hi everyone,
 
  I'm wondering what people consider the most efficient and brain-damage
 free
  way to automate the creation of presentation slides with Python.
 Scripting
  Powerpoint via COM? Generating a Keynote XML file? Something else that's
  easier (hopefully)?

 Maybe this:
 http://pypi.python.org/pypi/bruce

 Kent




-- 
morenotestoself.wordpress.com
financialpython.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] building database with sqlite3 and matplotlib

2009-08-08 Thread David Kim
Thanks so much for the comments! I appreciate the look. It's hard to know
what the best practices are (or if they even exist).

On Sat, Aug 8, 2009 at 2:28 PM, Kent Johnson ken...@tds.net wrote:

 You don't seem to actually have a main(). Are you running this by importing
 it?

 I would make a separate function to create the database and call that
 from build_database().


I was running it as a standalone to see if it worked but forgot to move the
code to main(). I cut but never pasted!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] building database with sqlite3 and matplotlib

2009-08-07 Thread David Kim
Hello everyone,

I've been learning python in a vacuum for the past few months and I
was wondering whether anyone would be willing to take a look at some
code? I've been messing around with sqlite and matplotlib, but I
couldn't get all the string substitution (using ?s). I ended up
getting the script to work, but i stupidly didn't save the version of
the script that I thought would work but didn't. (did that make
sense?).

The code can be found at
http://notestoself.posterous.com/use-python-and-sqlite3-to-build-a-database-co
A short summary of what I did is at
http://notestoself.posterous.com/use-python-and-sqlite3-to-build-a-database

(Or should I have pasted the code in this message?)

I've been trying to learn from books, but some critique would be very
appreciated. I'm just trying to get a sense of whether I'm doing
things in a unnecessarily convoluted way.

--
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-07 Thread David Kim
Thanks Kent, perhaps I'll cool the Python jets and move on to HTTP and
HTML. I was hoping it would be something I could just pick up along
the way, looks like I was wrong.

dk

On Tue, Jul 7, 2009 at 1:56 PM, Kent Johnsonken...@tds.net wrote:
 On Tue, Jul 7, 2009 at 1:20 PM, David Kimdavidki...@gmail.com wrote:
 On Tue, Jul 7, 2009 at 7:26 AM, Kent Johnsonken...@tds.net wrote:

 curl works because it ignores the redirect to the ToS page, and the
 site is (astoundingly) dumb enough to serve the content with the
 redirect. You could make urllib2 behave the same way by defining a 302
 handler that does nothing.

 Many thanks for the redirect pointer! I also found
 http://diveintopython.org/http_web_services/redirects.html. Is the
 handler class on this page what you mean by a handler that does
 nothing? (It looks like it exposes the error code but still follows
 the redirect).

 No, all of those examples are handling the redirect. The
 SmartRedirectHandler just captures additional status. I think you need
 something like this:
 class IgnoreRedirectHandler(urllib2.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        return None

    def http_error_302(self, req, fp, code, msg, headers):
        return None

 I guess i'm still a little confused since, if the
 handler does nothing, won't I still go to the ToS page?

 No, it is the action of the handler, responding to the redirect
 request, that causes the ToS page to be fetched.

 For example, I ran the following code (found at
 http://stackoverflow.com/questions/554446/how-do-i-prevent-pythons-urllib2-from-following-a-redirect)

 That is pretty similar to the DiP code...

 I suspect I am not understanding something basic about how urllib2
 deals with this redirect issue since it seems everything I try gives
 me the same ToS page.

 Maybe you don't understand how redirect works in general...

 Generally you have to post to the same url as the form, giving the
 same data the form does. You can inspect the source of the form to
 figure this out. In this case the form is

 form method=post action=/products/consent.php
 input type=hidden value=tiwd/products/derivserv/data_table_i.php
 name=urltarget/
 input type=hidden value=1 name=check_one/
 input type=hidden value=tiwdata name=tag/
 input type=submit value=I Agree name=acknowledgement/
 input type=submit value=Decline name=acknowledgement/
 /form

 You generally need to enable cookie support in urllib2 as well,
 because the site will use a cookie to flag that you saw the consent
 form. This tutorial shows how to enable cookies and submit form data:
 http://personalpages.tds.net/~kent37/kk/00010.html

 I have seen the login examples where one provides values for the
 fields username and password (thanks Kent). Given the form above,
 however, it's unclear to me how one POSTs the form data when you
 aren't actually passing any parameters. Perhaps this is less of a
 Python question and more an http question (which unfortunately I know
 nothing about either).

 Yes, the parameters are listed in the form.

 If you don't have at least a basic understanding of HTTP and HTML you
 are going to have trouble with this project...

 Kent




-- 
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Urllib, mechanize, beautifulsoup, lxml do not compute (for me)!

2009-07-06 Thread David Kim
Hello all,

I have two questions I'm hoping someone will have the patience to
answer as an act of mercy.

I. How to get past a Terms of Service page?

I've just started learning python (have never done any programming
prior) and am trying to figure out how to open or download a website
to scrape data. The only problem is, whenever I try to open the link
(via urllib2, for example) I'm after, I end up getting the HTML to a
Terms of Service Page (where one has to click an I Agree button)
rather than the actual target page.

I've seen examples on the web on providing data for forms (typically
by finding the name of the form and providing some sort of dictionary
to fill in the form fields), but this simple act of getting past I
Agree is stumping me. Can anyone save my sanity? As a workaround,
I've been using os.popen('curl ' + url ' ' filename) to save the html
in a txt file for later processing. I have no idea why curl works and
urllib2, for example, doesn't (I use OS X). I even tried to use Yahoo
Pipes to try and sidestep coding anything altogether, but ended up
looking at the same Terms of Service page anyway.

Here's the code (tho it's probably not that illuminating since it's
basically just opening a url):

import urllib2
url = 'http://www.dtcc.com/products/derivserv/data_table_i.php?id=table1'
#the first of 23 tables
html = urllib2.urlopen(url).read()

II. How to parse html tables with lxml, beautifulsoup? (for dummies)

Assuming i get past the Terms of Service, I'm a bit overwhelmed by the
need to know XPath, CSS, XML, DOM, etc. to scrape data from the web.
I've tried looking at the documentation included with different python
libraries, but just got more confused.

The basic tutorials show something like the following:

from lxml import html
doc = html.parse(/path/to/test.txt) #the file i downloaded via curl
root = doc.getroot() #what is this root business?
tables = root.cssselect('table')

I understand that selecting all the table tags will somehow target
however many tables on the page. The problem is the table has multiple
headers, empty cells, etc. Most of the examples on the web have to do
with scraping the web for search results or something that don't
really depend on the table format for anything other than layout. Are
there any resources out there that are appropriate for web/python
illiterati like myself that deal with structured data as in the url
above?

FYI, the data in the url above goes up in smoke every week, so I'm
trying to capture it automatically on a weekly basis. Getting all of
it into a CSV or database would be a personal cause for celebration as
it would be the first really useful thing I've done with python since
starting to learn it a few months ago.

For anyone who is interested, here is the code that uses curl to
pull the webpages. It basically just builds the url string for the
different table-pages and saves down the file with a timestamped
filename:

import os
from time import strftime

BASE_URL = 'http://www.dtcc.com/products/derivserv/data_table_'
SECTIONS = {'section1':{'select':'i.php?id=table', 'id':range(1,9)},
'section2':{'select':'ii.php?id=table', 'id':range(9,17)},
'section3':{'select':'iii.php?id=table', 'id':range(17,24)}
}

def get_pages():

filenames = []
path = '~/Dev/Data/DTCC_DerivServ/'
#os.popen('cd ' + path)

for section in SECTIONS:
for id in SECTIONS[section]['id']:
#urlList.append(BASE_URL + SECTIONS[section]['select']+str(id))
url = BASE_URL + SECTIONS[section]['select'] + str(id)
timestamp = strftime('%Y%m%d_')
#sectionName = BASE_URL.split('/')[-1]
sectionNumber = SECTIONS[section]['select'].split('.')[0]
tableNumber = str(id) + '_'
filename = timestamp + tableNumber + sectionNumber + '.txt'
os.popen('curl ' + url + ' ' + path + filename)
filenames.append(filename)

return filenames

if (__name__ == '__main__'):
get_pages()


--
morenotestoself.wordpress.com
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor