Re: Parsing html with Beautifulsoup

2009-12-14 Thread Gabriel Genellina

En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies 
escribió:

On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:



cell.findAll(text=True) returns a list of all text nodes inside a
 cell; I preprocess all \n and   in each text node, and
join them all. lines is a list of lists (each entry one cell), as
expected by the csv module used to write the output file.


I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do.  Thanks for your
explanation.


See  
http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-text



As far as I can see there was no documentation installed with the
debian package.


BeautifulSoup is very small - a single .py file, no dependencies. The  
whole documentation is contained in the above linked page.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-13 Thread Johann Spies
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:

> this code should serve as a starting point:

Thank you very much!

> cell.findAll(text=True) returns a list of all text nodes inside a
>  cell; I preprocess all \n and   in each text node, and
> join them all. lines is a list of lists (each entry one cell), as
> expected by the csv module used to write the output file.

I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do.  Thanks for your
explanation. 

As far as I can see there was no documentation installed with the
debian package.

Regards
Johann
-- 
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

 "But I will hope continually, and will yet praise thee 
  more and more."  Psalms 71:14 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-13 Thread Gabriel Genellina
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies   
escribió:



Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies   
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for  sunetint

and still provide the text-parts in the 's with plain text?


Hard to tell if we don't see what's inside those 's - please  
provide at least a few rows of the original HTML table.



Thanks for your reply. Here are a few lines:


2src=icons/usrgroup.png> All us...@any Anysrc=icons/clientencrypt.png> clientencrypt
  


I *think* I finally understand what you want (your previous example above  
confused me).

If you want for Rule 1 to generate a line like this:

2,All us...@any,cell.findAll(text=True) returns a list of all text nodes inside a   
cell; I preprocess all \n and   in each text node, and join them all.  
lines is a list of lists (each entry one cell), as expected by the csv  
module used to write the output file.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies

Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies  
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for  sunetint

and still provide the text-parts in the 's with plain text?


Hard to tell if we don't see what's inside those 's - please 
provide at least a few rows of the original HTML table.


Thanks for your reply. 


Here are a few lines:


2src=icons/usrgroup.png> All us...@any Anysrc=icons/clientencrypt.png> clientencrypt
  


3src=icons/any.png> Any
  


4src=icons/group.png>  >Rainwall_Group 
 >RainWall_Stop&nb$

  


5src=icons/host.png>  >Rainwall_Group 
 >Rainwall_Group 
 >Rainwall_Broadcast 
 >RainWall_Daemon
  

Regards
Johann

--
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"Lo, children are an heritage of the LORD: and the  
 fruit of the womb is his reward."Psalms 127:3 



--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-10 Thread Gabriel Genellina
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies   
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for  sunetint

and still provide the text-parts in the 's with plain text?


Hard to tell if we don't see what's inside those 's - please provide  
at least a few rows of the original HTML table.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies
I am trying to get csv-output from a html-file.

With this code I had a little success:
=
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
#print(','.join(t))

for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace(' ','')
g.write(t)
except:
g.write ('')
g.write(",")
g.write("\n")
===

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1drop,Log,Any,,,
2,All us...@any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4drop,None,Any,,,
...

It left out all the non-plaintext parts of 

I then tried using 

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1, 
sunetint, 
 href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster ,
src=icons/udp.png> IKE,
 drop,
 Log ,
 Any ,
 Rainwall_Cluster  , 

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for  sunetint

and still provide the text-parts in the 's with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
-- 
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

 "Lo, children are an heritage of the LORD: and the  
  fruit of the womb is his reward."Psalms 127:3 
-- 
http://mail.python.org/mailman/listinfo/python-list