Re: [Tutor] Parsing html tables and using numpy for subsequent processing

2009-09-17 Thread David Kim

 Gerard wrote:
 Not very pretty, but I imagine there are very few pretty examples of
 this kind of thing. I'll add more comments...honest. Nothing obviously
 wrong with your code to my eyes.


Many thanks gerard, appreciate you looking it over. I'll take a look at the
link you posted as well (I'm traveling at the moment).

Cheers,

-- 
David Kim

I hear and I forget. I see and I remember. I do and I understand. --
 Confucius

morenotestoself.wordpress.com
financialpython.wordpress.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing html tables and using numpy for subsequentprocessing

2009-09-16 Thread Alan Gauld
David Kim davidki...@gmail.com wrote

 The code can be found at pastebin:
 http://financialpython.pastebin.com/f4efd8930

Nothing to do with the parsing but I noticed:

def get_files(path): 
  ''' Get a list of all files in a given directory. 
  Returns a list of filename strings. ''' 
  files = os.listdir(path) 
  return files 

Since you are just returning the result of listdir you 
could achieve the same effect by simply aliasing listdir:

get_files = os.listdir


Much less typing!


HTH,


-- 
Alan Gauld
Author of the Learn to Program web site
http://www.alan-g.me.uk/

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing html tables and using numpy for subsequent processing

2009-09-16 Thread Gerard Flanagan

David Kim wrote:

Hello all,

I've finally gotten around to my 'learn how to parse html' project. For 
those of you looking for examples (like me!), hopefully it will show you 
one potentially thickheaded way to do it.

[...]

The code can be found at pastebin: 
http://financialpython.pastebin.com/f4efd8930
The original html can be found at 
http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and 
parsing tables from all three sections).




Doing something similar at the minute if you want to compare:

http://bitbucket.org/djerdo/tronslenk/src/tip/data/scrape_translink.py


Not very pretty, but I imagine there are very few pretty examples of 
this kind of thing. I'll add more comments...honest. Nothing obviously 
wrong with your code to my eyes.


Regards

g.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing html tables and using numpy for subsequent processing

2009-09-15 Thread David Kim
Hello all,

I've finally gotten around to my 'learn how to parse html' project. For
those of you looking for examples (like me!), hopefully it will show you one
potentially thickheaded way to do it.

For those of you with powerful python-fu, I would appreciate any feedback
regarding the direction I'm taking and obvious coding no-no's (I have no
formal training in computer science). Please note the project is unfinished,
so there isn't a nice, neat result quite yet.

Rather than spam the list with a long description, please visit the
following post where I outline my approach and provide necessary links --
http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/

The code can be found at pastebin:
http://financialpython.pastebin.com/f4efd8930
The original html can be found at
http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and
parsing tables from all three sections).

Many thanks!

-- DK
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parsing html.

2008-01-16 Thread Alan Gauld

Shriphani Palakodety [EMAIL PROTECTED] wrote in

 I have a html document here which goes like this:

 A name=4/abTable of Contents/b
 .
 A name=5/abPreface/b

 Can someone tell me how I can get the string between the b tag for
 an a tag for a given value of the name attribute.

Heres an example using the standard library HTML parser
(from an unfinished topic in tutorial...). You could also
use BeautifulSoup and I recommend that if your needs get
any more complex...

--
In practice we usually want to extract more specific data from a page, 
maybe the content of a particular row in a table or similar. For that 
we need to use the handle_starttag() and handle_endtag() methods. As 
an example let's extract the text of the second H1 level header:
html = '''
htmlheadtitleTest page/title/head
body
center
h1Here is the first heading/h1
/center
pA short paragraph
h1A second heading/h1
pA paragraph containing a
a href=www.google.comhyperlink to google/a
/body/html
'''

from HTMLParser import HTMLParser

class H1Parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.h1_count = 0
self.isHeading = False

def handle_starttag(self,tag,attributes=None):
if tag == 'h1':
self.h1_count += 1
self.isHeading = True

def handle_endtag(self,tag):
if tag == 'h1':
self.isHeading = False

def handle_data(self,data):
if self.isHeading and self.h1_count == 2:
print Second Header contained: , data

parser = H1Parser()
parser.feed(html)
parser.close()
--Hopefully you can see how to alter that 
pattern to suit your scenario.-- Alan GauldAuthor of the Learn to 
Program web sitehttp://www.freenetpages.co.uk/hp/alan.gauld 


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parsing html.

2008-01-16 Thread Kent Johnson
Shriphani Palakodety wrote:
 Hello,
 I have a html document here which goes like this:
 
 A name=4/abTable of Contents/b
 .
 A name=5/abPreface/b
 
 Can someone tell me how I can get the string between the b tag for
 an a tag for a given value of the name attribute.

In [30]: from BeautifulSoup import BeautifulSoup
In [31]: text = '''A name=4/abTable of Contents/b
: .
: A name=5/abPreface/b'''
In [32]: soup = BeautifulSoup(text)
In [40]: soup.find('a', dict(name='5'))
Out[40]: a name=5/a
In [41]: soup.find('a', dict(name='5')).next
Out[41]: bPreface/b
In [42]: soup.find('a', dict(name='5')).next.string
Out[42]: u'Preface'

Note BeautifulSoup lower-cases the tag name.
http://www.crummy.com/software/BeautifulSoup/

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] parsing html.

2008-01-16 Thread Paul McGuire
Here is a pyparsing approach to your question.  I've added some comments to
walk you through the various steps.  By using pyparsing's makeHTMLTags
helper method, it is easy to write short programs to skim selected data tags
from out of an HTML page.

-- Paul


from pyparsing import makeHTMLTags, SkipTo

html = 
A name=4/abTable of Contents/b
.
A name=5/abPreface/b


# define the pattern to search for, using pyparsing makeHTMLTags helper
# makeHTMLTags constructs a very tolerant mini-pattern, to match HTML
# tags with the given tag name:
# - caseless matching on the tag name
# - embedded whitespace is handled
# - detection of empty tags (opening tags that end in /)
# - detection of tag attributes
# - returning parsed data using results names for attribute values
# makeHTMLTags actually returns two patterns, one for the opening tag
# and one for the closing tag
aStart,aEnd = makeHTMLTags(A)
bStart,bEnd = makeHTMLTags(B)
pattern = aStart + aEnd + bStart + SkipTo(bEnd)(text) + bEnd

# search the input string - dump matched structure for each match
for pp in pattern.searchString(html):
print pp.dump()
print pp.startA.name, pp.text

# parse input and build a dict using the results
nameDict = dict( (pp.startA.name,pp.text) for pp in
pattern.searchString(html) )
print nameDict


The last line of the output is the dict that is created:

{'5': 'Preface', '4': 'Table of Contents'}




___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] parsing html.

2008-01-15 Thread Shriphani Palakodety
Hello,
I have a html document here which goes like this:

A name=4/abTable of Contents/b
.
A name=5/abPreface/b

Can someone tell me how I can get the string between the b tag for
an a tag for a given value of the name attribute.

Thanks,
Shriphani Palakodety
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] Parsing html user HTMLParser

2006-04-20 Thread ទិត្យវិរៈ
Hi folks,

I need help here, I'm struggling with html parsing method, up until now
I can only put and html file as instance. I have no experience with
this, I want to read the childs inside this document and modify the
data. What can I do if I start from here?

 from HTMLParser import HTMLParser

 p = HTMLParser()
 s = open('/home/virak/Documents/peace/test.html').read()
 p.feed(s)

 print p

 p.close()
Titvirak
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing html user HTMLParser

2006-04-20 Thread Kent Johnson
ទិត្យវិរៈ wrote:
 Hi folks,
 
 I need help here, I'm struggling with html parsing method, up until now
 I can only put and html file as instance. I have no experience with
 this, I want to read the childs inside this document and modify the
 data. What can I do if I start from here?
 
 from HTMLParser import HTMLParser

 p = HTMLParser()
 s = open('/home/virak/Documents/peace/test.html').read()
 p.feed(s)

 print p

 p.close()

Here is an example that might be useful, though the usage is not too 
clear...
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286269

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Parsing html user HTMLParser

2006-04-20 Thread Danny Yoo


 I need help here, I'm struggling with html parsing method, up until now 
 I can only put and html file as instance. I have no experience with 
 this, I want to read the childs inside this document and modify the 
 data. What can I do if I start from here?

Hi Titvirak,

You might want to take a look at a different module for parsing HTML.  A 
popular one is BeautifulSoup:

 http://www.crummy.com/software/BeautifulSoup/

Their quick-start page shows how to do simple stuff.  There are a few 
oddities with BeautifulSoup, but on the whole, it's pretty good.

Good luck to you!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor