Re: [Tutor] Parsing html tables and using numpy for subsequent processing
Gerard wrote: Not very pretty, but I imagine there are very few pretty examples of this kind of thing. I'll add more comments...honest. Nothing obviously wrong with your code to my eyes. Many thanks gerard, appreciate you looking it over. I'll take a look at the link you posted as well (I'm traveling at the moment). Cheers, -- David Kim I hear and I forget. I see and I remember. I do and I understand. -- Confucius morenotestoself.wordpress.com financialpython.wordpress.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing html tables and using numpy for subsequentprocessing
David Kim davidki...@gmail.com wrote The code can be found at pastebin: http://financialpython.pastebin.com/f4efd8930 Nothing to do with the parsing but I noticed: def get_files(path): ''' Get a list of all files in a given directory. Returns a list of filename strings. ''' files = os.listdir(path) return files Since you are just returning the result of listdir you could achieve the same effect by simply aliasing listdir: get_files = os.listdir Much less typing! HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing html tables and using numpy for subsequent processing
David Kim wrote: Hello all, I've finally gotten around to my 'learn how to parse html' project. For those of you looking for examples (like me!), hopefully it will show you one potentially thickheaded way to do it. [...] The code can be found at pastebin: http://financialpython.pastebin.com/f4efd8930 The original html can be found at http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and parsing tables from all three sections). Doing something similar at the minute if you want to compare: http://bitbucket.org/djerdo/tronslenk/src/tip/data/scrape_translink.py Not very pretty, but I imagine there are very few pretty examples of this kind of thing. I'll add more comments...honest. Nothing obviously wrong with your code to my eyes. Regards g. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing html tables and using numpy for subsequent processing
Hello all, I've finally gotten around to my 'learn how to parse html' project. For those of you looking for examples (like me!), hopefully it will show you one potentially thickheaded way to do it. For those of you with powerful python-fu, I would appreciate any feedback regarding the direction I'm taking and obvious coding no-no's (I have no formal training in computer science). Please note the project is unfinished, so there isn't a nice, neat result quite yet. Rather than spam the list with a long description, please visit the following post where I outline my approach and provide necessary links -- http://financialpython.wordpress.com/2009/09/15/parsing-dtcc-part-1-pita/ The code can be found at pastebin: http://financialpython.pastebin.com/f4efd8930 The original html can be found at http://www.dtcc.com/products/derivserv/data/index.php (I am pulling and parsing tables from all three sections). Many thanks! -- DK ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parsing html.
Shriphani Palakodety [EMAIL PROTECTED] wrote in I have a html document here which goes like this: A name=4/abTable of Contents/b . A name=5/abPreface/b Can someone tell me how I can get the string between the b tag for an a tag for a given value of the name attribute. Heres an example using the standard library HTML parser (from an unfinished topic in tutorial...). You could also use BeautifulSoup and I recommend that if your needs get any more complex... -- In practice we usually want to extract more specific data from a page, maybe the content of a particular row in a table or similar. For that we need to use the handle_starttag() and handle_endtag() methods. As an example let's extract the text of the second H1 level header: html = ''' htmlheadtitleTest page/title/head body center h1Here is the first heading/h1 /center pA short paragraph h1A second heading/h1 pA paragraph containing a a href=www.google.comhyperlink to google/a /body/html ''' from HTMLParser import HTMLParser class H1Parser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.h1_count = 0 self.isHeading = False def handle_starttag(self,tag,attributes=None): if tag == 'h1': self.h1_count += 1 self.isHeading = True def handle_endtag(self,tag): if tag == 'h1': self.isHeading = False def handle_data(self,data): if self.isHeading and self.h1_count == 2: print Second Header contained: , data parser = H1Parser() parser.feed(html) parser.close() --Hopefully you can see how to alter that pattern to suit your scenario.-- Alan GauldAuthor of the Learn to Program web sitehttp://www.freenetpages.co.uk/hp/alan.gauld ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parsing html.
Shriphani Palakodety wrote: Hello, I have a html document here which goes like this: A name=4/abTable of Contents/b . A name=5/abPreface/b Can someone tell me how I can get the string between the b tag for an a tag for a given value of the name attribute. In [30]: from BeautifulSoup import BeautifulSoup In [31]: text = '''A name=4/abTable of Contents/b : . : A name=5/abPreface/b''' In [32]: soup = BeautifulSoup(text) In [40]: soup.find('a', dict(name='5')) Out[40]: a name=5/a In [41]: soup.find('a', dict(name='5')).next Out[41]: bPreface/b In [42]: soup.find('a', dict(name='5')).next.string Out[42]: u'Preface' Note BeautifulSoup lower-cases the tag name. http://www.crummy.com/software/BeautifulSoup/ Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] parsing html.
Here is a pyparsing approach to your question. I've added some comments to walk you through the various steps. By using pyparsing's makeHTMLTags helper method, it is easy to write short programs to skim selected data tags from out of an HTML page. -- Paul from pyparsing import makeHTMLTags, SkipTo html = A name=4/abTable of Contents/b . A name=5/abPreface/b # define the pattern to search for, using pyparsing makeHTMLTags helper # makeHTMLTags constructs a very tolerant mini-pattern, to match HTML # tags with the given tag name: # - caseless matching on the tag name # - embedded whitespace is handled # - detection of empty tags (opening tags that end in /) # - detection of tag attributes # - returning parsed data using results names for attribute values # makeHTMLTags actually returns two patterns, one for the opening tag # and one for the closing tag aStart,aEnd = makeHTMLTags(A) bStart,bEnd = makeHTMLTags(B) pattern = aStart + aEnd + bStart + SkipTo(bEnd)(text) + bEnd # search the input string - dump matched structure for each match for pp in pattern.searchString(html): print pp.dump() print pp.startA.name, pp.text # parse input and build a dict using the results nameDict = dict( (pp.startA.name,pp.text) for pp in pattern.searchString(html) ) print nameDict The last line of the output is the dict that is created: {'5': 'Preface', '4': 'Table of Contents'} ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] parsing html.
Hello, I have a html document here which goes like this: A name=4/abTable of Contents/b . A name=5/abPreface/b Can someone tell me how I can get the string between the b tag for an a tag for a given value of the name attribute. Thanks, Shriphani Palakodety ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Parsing html user HTMLParser
Hi folks, I need help here, I'm struggling with html parsing method, up until now I can only put and html file as instance. I have no experience with this, I want to read the childs inside this document and modify the data. What can I do if I start from here? from HTMLParser import HTMLParser p = HTMLParser() s = open('/home/virak/Documents/peace/test.html').read() p.feed(s) print p p.close() Titvirak ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing html user HTMLParser
ទិត្យវិរៈ wrote: Hi folks, I need help here, I'm struggling with html parsing method, up until now I can only put and html file as instance. I have no experience with this, I want to read the childs inside this document and modify the data. What can I do if I start from here? from HTMLParser import HTMLParser p = HTMLParser() s = open('/home/virak/Documents/peace/test.html').read() p.feed(s) print p p.close() Here is an example that might be useful, though the usage is not too clear... http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286269 Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Parsing html user HTMLParser
I need help here, I'm struggling with html parsing method, up until now I can only put and html file as instance. I have no experience with this, I want to read the childs inside this document and modify the data. What can I do if I start from here? Hi Titvirak, You might want to take a look at a different module for parsing HTML. A popular one is BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/ Their quick-start page shows how to do simple stuff. There are a few oddities with BeautifulSoup, but on the whole, it's pretty good. Good luck to you! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor