Re: Parsing html with Beautifulsoup
En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies escribió: On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote: cell.findAll(text=True) returns a list of all text nodes inside a cell; I preprocess all \n and in each text node, and join them all. lines is a list of lists (each entry one cell), as expected by the csv module used to write the output file. I have struggled a bit to find the documentation for (text=True). Most of documentation for Beautifulsoup I saw mostly contained some examples without explaining what the options do. Thanks for your explanation. See http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-text As far as I can see there was no documentation installed with the debian package. BeautifulSoup is very small - a single .py file, no dependencies. The whole documentation is contained in the above linked page. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing html with Beautifulsoup
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote: > this code should serve as a starting point: Thank you very much! > cell.findAll(text=True) returns a list of all text nodes inside a > cell; I preprocess all \n and in each text node, and > join them all. lines is a list of lists (each entry one cell), as > expected by the csv module used to write the output file. I have struggled a bit to find the documentation for (text=True). Most of documentation for Beautifulsoup I saw mostly contained some examples without explaining what the options do. Thanks for your explanation. As far as I can see there was no documentation installed with the debian package. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "But I will hope continually, and will yet praise thee more and more." Psalms 71:14 -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing html with Beautifulsoup
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies escribió: Gabriel Genellina het geskryf: En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for sunetint and still provide the text-parts in the 's with plain text? Hard to tell if we don't see what's inside those 's - please provide at least a few rows of the original HTML table. Thanks for your reply. Here are a few lines: 2src=icons/usrgroup.png> All us...@any Anysrc=icons/clientencrypt.png> clientencrypt I *think* I finally understand what you want (your previous example above confused me). If you want for Rule 1 to generate a line like this: 2,All us...@any,cell.findAll(text=True) returns a list of all text nodes inside a cell; I preprocess all \n and in each text node, and join them all. lines is a list of lists (each entry one cell), as expected by the csv module used to write the output file. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing html with Beautifulsoup
Gabriel Genellina het geskryf: En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for sunetint and still provide the text-parts in the 's with plain text? Hard to tell if we don't see what's inside those 's - please provide at least a few rows of the original HTML table. Thanks for your reply. Here are a few lines: 2src=icons/usrgroup.png> All us...@any Anysrc=icons/clientencrypt.png> clientencrypt 3src=icons/any.png> Any 4src=icons/group.png> >Rainwall_Group >RainWall_Stop&nb$ 5src=icons/host.png> >Rainwall_Group >Rainwall_Group >Rainwall_Broadcast >RainWall_Daemon Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "Lo, children are an heritage of the LORD: and the fruit of the womb is his reward."Psalms 127:3 -- http://mail.python.org/mailman/listinfo/python-list
Re: Parsing html with Beautifulsoup
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for sunetint and still provide the text-parts in the 's with plain text? Hard to tell if we don't see what's inside those 's - please provide at least a few rows of the original HTML table. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Parsing html with Beautifulsoup
I am trying to get csv-output from a html-file. With this code I had a little success: = from BeautifulSoup import BeautifulSoup from string import replace, join import re f = open("configuration.html","r") g = open("configuration.csv",'w') soup = BeautifulSoup(f) t = soup.findAll('table') for table in t: rows = table.findAll('tr') for th in rows[0]: t = th.find(text=True) g.write(t) g.write(',') #print(','.join(t)) for tr in rows: cols = tr.findAll('td') for td in cols: try: t = td.find(text=True).replace(' ','') g.write(t) except: g.write ('') g.write(",") g.write("\n") === producing output like this: RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS, 1drop,Log,Any,,, 2,All us...@any,,Any,clientencrypt,Log,Any,,, 3,Any,Any,,drop,None,Any,,, 4drop,None,Any,,, ... It left out all the non-plaintext parts of I then tried using t.renderContents and then got something like this (one line broken into many for the sake of this email): 1, sunetint, href=#OBJ_Rainwall_Cluster >Rainwall_Cluster , src=icons/udp.png> IKE, drop, Log , Any , Rainwall_Cluster , How do I get Beautifulsoup to render (taking the above line as example) sunentint for sunetint and still provide the text-parts in the 's with plain text? I have experimented a little bit with regular expressions, but could so far not find a solution. Regards Johann -- Johann Spies Telefoon: 021-808 4599 Informasietegnologie, Universiteit van Stellenbosch "Lo, children are an heritage of the LORD: and the fruit of the womb is his reward."Psalms 127:3 -- http://mail.python.org/mailman/listinfo/python-list