Re: Parsing HTML with xml.etree in Python 2.7?

2015-10-05 Thread Skip Montanaro
On Mon, Oct 5, 2015 at 9:14 AM, Skip Montanaro  wrote:
> I wouldn't be surprised if there were some small API changes other than the
> name change caused by the move into the xml package. Before I dive into a
> rabbit hole and start to modify elementtidy, is there some other stdlib-only
> way to parse HTML code into an xml.etree.ElementTree?

Never mind. The only change necessary turned out to be the import. /F
writes robust code. :-)

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Parsing HTML with xml.etree in Python 2.7?

2015-10-05 Thread Skip Montanaro
Back before Fredrik Lundh's elementtree module was sucked into the Python
stdlib as xml.etree, I used to use his elementtidy extension module to
clean up HTML source so it could be parsed into an ElementTree object.
Elementtidy hasn't be updated in about ten years, and still assumes there
is a module named "elementtree" which it can import.

I wouldn't be surprised if there were some small API changes other than the
name change caused by the move into the xml package. Before I dive into a
rabbit hole and start to modify elementtidy, is there some other
stdlib-only way to parse HTML code into an xml.etree.ElementTree?

Thx,

Skip
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-14 Thread Gabriel Genellina

En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies jsp...@sun.ac.za
escribió:

On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:



cell.findAll(text=True) returns a list of all text nodes inside a
td cell; I preprocess all \n and nbsp; in each text node, and
join them all. lines is a list of lists (each entry one cell), as
expected by the csv module used to write the output file.


I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do.  Thanks for your
explanation.


See  
http://www.crummy.com/software/BeautifulSoup/documentation.html#arg-text



As far as I can see there was no documentation installed with the
debian package.


BeautifulSoup is very small - a single .py file, no dependencies. The  
whole documentation is contained in the above linked page.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-13 Thread Gabriel Genellina
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies jsp...@sun.ac.za  
escribió:



Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies jsp...@sun.ac.za  
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for img src=icons/group.pngnbsp;a
href=#OBJ_sunetintsunetint/ABR

and still provide the text-parts in the td's with plain text?


Hard to tell if we don't see what's inside those td's - please  
provide at least a few rows of the original HTML table.



Thanks for your reply. Here are a few lines:

!--- Rule 1 ---
tr style=background-color: #fftd class=normal2/tdtdimg  
src=icons/usrgroup.pngnbsp;All us...@anybrtdim$
/tdtdimg src=icons/any.pngnbsp;Anybr/tdtdimg  
src=icons/clientencrypt.pngnbsp;clientencrypt/tdtdimg src$

nbsp;/tdtdnbsp;/td/tr


I *think* I finally understand what you want (your previous example above  
confused me).

If you want for Rule 1 to generate a line like this:

2,All us...@any,im$,Any,clientencrypt,,

this code should serve as a starting point:

lines = []
soup = BeautifulSoup(html)
for table in soup.findAll(table):
 for row in table.findAll(tr):
  line = []
  for cell in row.findAll(td):
text = ' '.join(
s.replace('\n',' ').replace('nbsp;',' ')
for s in cell.findAll(text=True)).strip()
line.append(text)
  lines.append(line)

import csv
with open(output.csv,wb) as f:
  writer = csv.writer(f)
  writer.writerows(lines)

cell.findAll(text=True) returns a list of all text nodes inside a td  
cell; I preprocess all \n and nbsp; in each text node, and join them all.  
lines is a list of lists (each entry one cell), as expected by the csv  
module used to write the output file.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-13 Thread Johann Spies
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote:

 this code should serve as a starting point:

Thank you very much!

 cell.findAll(text=True) returns a list of all text nodes inside a
 td cell; I preprocess all \n and nbsp; in each text node, and
 join them all. lines is a list of lists (each entry one cell), as
 expected by the csv module used to write the output file.

I have struggled a bit to find the documentation for (text=True).
Most of documentation for Beautifulsoup I saw mostly contained some
examples without explaining what the options do.  Thanks for your
explanation. 

As far as I can see there was no documentation installed with the
debian package.

Regards
Johann
-- 
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

 But I will hope continually, and will yet praise thee 
  more and more.  Psalms 71:14 
-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies
I am trying to get csv-output from a html-file.

With this code I had a little success:
=
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re

f = open(configuration.html,r)
g = open(configuration.csv,'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
rows = table.findAll('tr')
for th in rows[0]:
t = th.find(text=True)
g.write(t)
g.write(',')
#print(','.join(t))

for tr in rows:
cols = tr.findAll('td')
for td in cols:
try:
t = td.find(text=True).replace('nbsp;','')
g.write(t)
except:
g.write ('')
g.write(,)
g.write(\n)
===

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1drop,Log,Any,,,
2,All us...@any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4drop,None,Any,,,
...

It left out all the non-plaintext parts of td/td

I then tried using 

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,img src=icons/group.pngnbsp;a href=#OBJ_sunetint
sunetint/ABR, 
img src=icons/gateway_cluster.pngnbsp;ahref=#OBJ_Rainwall_Cluster
Rainwall_Cluster/A BR,
imgsrc=icons/udp.pngnbsp;a href=#SVC_IKE IKE/abr,
img src=icons/drop.pngnbsp;drop,
img src=icons/log.pngnbsp;Lognbsp;,
img src=icons/any.pngnbsp;Anybrnbsp;,
img src=icons/gateway_cluster.pngnbsp;a href=#OBJ_Rainwall_Cluster
Rainwall_Cluster/A BRnbsp;,nbsp;

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for img src=icons/group.pngnbsp;a
href=#OBJ_sunetintsunetint/ABR

and still provide the text-parts in the td's with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
-- 
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

 Lo, children are an heritage of the LORD: and the  
  fruit of the womb is his reward.Psalms 127:3 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-10 Thread Gabriel Genellina
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies jsp...@sun.ac.za  
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for img src=icons/group.pngnbsp;a
href=#OBJ_sunetintsunetint/ABR

and still provide the text-parts in the td's with plain text?


Hard to tell if we don't see what's inside those td's - please provide  
at least a few rows of the original HTML table.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies

Gabriel Genellina het geskryf:
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies jsp...@sun.ac.za 
escribió:



How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for img src=icons/group.pngnbsp;a
href=#OBJ_sunetintsunetint/ABR

and still provide the text-parts in the td's with plain text?


Hard to tell if we don't see what's inside those td's - please 
provide at least a few rows of the original HTML table.


Thanks for your reply. 


Here are a few lines:

!--- Rule 1 ---
tr style=background-color: #fftd class=normal2/tdtdimg 
src=icons/usrgroup.pngnbsp;All us...@anybrtdim$
/tdtdimg src=icons/any.pngnbsp;Anybr/tdtdimg 
src=icons/clientencrypt.pngnbsp;clientencrypt/tdtdimg src$

nbsp;/tdtdnbsp;/td/tr

!--- Rule 2 ---
tr style=background-color: #eetd class=normal3/tdtdimg 
src=icons/any.pngnbsp;Anybrtdimg src=icons/any$

nbsp;/tdtdnbsp;/td/tr

!--- Rule 3 ---
tr style=background-color: #fftd class=normal4/tdtdimg 
src=icons/group.pngnbsp;a href=#OBJ_Rainwall_Group$
tdimg src=icons/group.pngnbsp;a href=#OBJ_Rainwall_Group 
Rainwall_Group/A BR
/tdtdimg src=icons/udp.pngnbsp;a href=#SVC_RainWall_Stop 
RainWall_Stop/abr/tdtdimg src=icons/drop.pngnb$

nbsp;/tdtdnbsp;/td/tr

!--- Rule 4 ---
tr style=background-color: #eetd class=normal5/tdtdimg 
src=icons/host.pngnbsp;a href=#OBJ_Rainwall_Broadc$
img src=icons/group.pngnbsp;a href=#OBJ_Rainwall_Group 
Rainwall_Group/A BR
tdimg src=icons/group.pngnbsp;a href=#OBJ_Rainwall_Group 
Rainwall_Group/A BR
img src=icons/host.pngnbsp;a href=#OBJ_Rainwall_Broadcast 
Rainwall_Broadcast/A BR
/tdtdimg src=icons/udp.pngnbsp;a href=#SVC_RainWall_Daemon 
RainWall_Daemon/abr/tdtdimg src=icons/accept.p$

nbsp;/tdtdnbsp;/td/tr

Regards
Johann

--
Johann Spies  Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

Lo, children are an heritage of the LORD: and the  
 fruit of the womb is his reward.Psalms 127:3 



--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-26 Thread Benjamin
On Apr 3, 9:10 pm, 7stud [EMAIL PROTECTED] wrote:
 On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote:

  BeautifulSoup does what I need it to.  Though, I was hoping to find
  something that would let me work with the DOM the way JavaScript can
  work with web browsers' implementations of the DOM.  Specifically, I'd
  like to be able to access the innerHTML element of a DOM element.
  Python's built-in HTMLParser is SAX-based, so I don't want to use
  that, and the minidom doesn't appear to implement this part of the
  DOM.

 innerHTML has never been part of the DOM.  It is however a defacto
 browser standard.  That's probably why you aren't having any luck
 using a python module that implements the DOM.

That makes sense.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-26 Thread Benjamin
On Apr 6, 11:03 pm, Stefan Behnel [EMAIL PROTECTED] wrote:
 Benjamin wrote:
  I'm trying to parse an HTML file.  I want to retrieve all of the text
  inside a certain tag that I find with XPath.  The DOM seems to make
  this available with the innerHTML element, but I haven't found a way
  to do it in Python.

     import lxml.html as h
     tree = h.parse(somefile.html)
     text = tree.xpath(string( some/[EMAIL PROTECTED] ))

 http://codespeak.net/lxml

 Stefan

I actually had trouble getting this to work.  I guess only new version
of lxml have the html module, and I couldn't get it installed.  lxml
does look pretty cool, though.
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-26 Thread Stefan Behnel
Benjamin wrote:
 On Apr 6, 11:03 pm, Stefan Behnel [EMAIL PROTECTED] wrote:
 Benjamin wrote:
 I'm trying to parse an HTML file.  I want to retrieve all of the text
 inside a certain tag that I find with XPath.  The DOM seems to make
 this available with the innerHTML element, but I haven't found a way
 to do it in Python.
 import lxml.html as h
 tree = h.parse(somefile.html)
 text = tree.xpath(string( some/[EMAIL PROTECTED] ))

 http://codespeak.net/lxml

 Stefan
 
 I actually had trouble getting this to work.  I guess only new version
 of lxml have the html module, and I couldn't get it installed.  lxml
 does look pretty cool, though.

Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

 import lxml.etree as et
 parser = etree.HTMLParser()
 tree = h.parse(somefile.html, parser)
 text = tree.xpath(string( some/[EMAIL PROTECTED] ))

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-07 Thread Stefan Behnel
Benjamin wrote:
 I'm trying to parse an HTML file.  I want to retrieve all of the text
 inside a certain tag that I find with XPath.  The DOM seems to make
 this available with the innerHTML element, but I haven't found a way
 to do it in Python.

import lxml.html as h
tree = h.parse(somefile.html)
text = tree.xpath(string( some/[EMAIL PROTECTED] ))

http://codespeak.net/lxml

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-03 Thread benash
BeautifulSoup does what I need it to.  Though, I was hoping to find
something that would let me work with the DOM the way JavaScript can
work with web browsers' implementations of the DOM.  Specifically, I'd
like to be able to access the innerHTML element of a DOM element.
Python's built-in HTMLParser is SAX-based, so I don't want to use
that, and the minidom doesn't appear to implement this part of the
DOM.

On Wed, Apr 2, 2008 at 10:37 PM, Daniel Fetchinson
[EMAIL PROTECTED] wrote:
  I'm trying to parse an HTML file.  I want to retrieve all of the text
   inside a certain tag that I find with XPath.  The DOM seems to make
   this available with the innerHTML element, but I haven't found a way
   to do it in Python.

  Have you tried http://www.google.com/search?q=python+html+parser ?

  HTH,
  Daniel

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-03 Thread Paul Boddie
On 3 Apr, 06:59, Benjamin [EMAIL PROTECTED] wrote:
 I'm trying to parse an HTML file.  I want to retrieve all of the text
 inside a certain tag that I find with XPath.  The DOM seems to make
 this available with the innerHTML element, but I haven't found a way
 to do it in Python.

With libxml2dom you'd do the following:

 1. Parse the file using libxml2dom.parse with html set to a true
value.
 2. Use the xpath method on the document to select the desired
element.
 3. Use the toString method on the element to get the text of the
element (including start and end tags), or the textContent
property
to get the text between the tags.

See the Package Index page for more details:

  http://www.python.org/pypi/libxml2dom

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-03 Thread Larry Bates

On Wed, 2008-04-02 at 21:59 -0700, Benjamin wrote:
 I'm trying to parse an HTML file.  I want to retrieve all of the text
 inside a certain tag that I find with XPath.  The DOM seems to make
 this available with the innerHTML element, but I haven't found a way
 to do it in Python.

I use ElementTree (built into Python 2.5) for this type of XLM query.

-Larry

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-03 Thread 7stud
On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote:
 BeautifulSoup does what I need it to.  Though, I was hoping to find
 something that would let me work with the DOM the way JavaScript can
 work with web browsers' implementations of the DOM.  Specifically, I'd
 like to be able to access the innerHTML element of a DOM element.
 Python's built-in HTMLParser is SAX-based, so I don't want to use
 that, and the minidom doesn't appear to implement this part of the
 DOM.


innerHTML has never been part of the DOM.  It is however a defacto
browser standard.  That's probably why you aren't having any luck
using a python module that implements the DOM.
-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML?

2008-04-02 Thread Benjamin
I'm trying to parse an HTML file.  I want to retrieve all of the text
inside a certain tag that I find with XPath.  The DOM seems to make
this available with the innerHTML element, but I haven't found a way
to do it in Python.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML?

2008-04-02 Thread Daniel Fetchinson
 I'm trying to parse an HTML file.  I want to retrieve all of the text
 inside a certain tag that I find with XPath.  The DOM seems to make
 this available with the innerHTML element, but I haven't found a way
 to do it in Python.

Have you tried http://www.google.com/search?q=python+html+parser ?

HTH,
Daniel
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Rob Wolfe

[EMAIL PROTECTED] wrote:

 So, I'm writing this to have your opinion on what tools I should use
 to do this and what technique I should use.

Take a look at parsing example on this page:
http://wiki.python.org/moin/SimplePrograms

--
HTH,
Rob

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 I work at this company and we are re-building our website: http://caslt.org/.
 The new website will be built by an external firm (I could do it
 myself, but since I'm just the summer student worker...). Anyways, to
 help them, they first asked me to copy all the text from all the pages
 of the site (and there is a lot!) to word documents. I found the idea
 pretty stupid since style would have to be applied from scratch anyway
 since we don't want to get neither old html code behind nor Microsoft
 Word BS code.
 
 I proposed to take each page and making a copy with only the text, and
 with class names for the textual elements (h1, h1, p, strong, em ...)
 and then define a css file giving them some style.
 
 Now, we have around 1 600 documents do work on, and I thought I could
 challenge myself a bit and automate all the dull work. I thought about
 the possibility of parsing all those pages with python, ripping of the
 navigations bars and just keeping the text and layout tags, and then
 applying class names to specific tags. The program would also have to
 remove the table where text is located in. And other difficulty is
 that I want to be able to keep tables that are actually used for
 tabular data and not positioning.
 
 So, I'm writing this to have your opinion on what tools I should use
 to do this and what technique I should use.

lxml is what you're looking for, especially if you're familiar with XPath.

http://codespeak.net/lxml/dev

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
Hi,

I work at this company and we are re-building our website: http://caslt.org/.
The new website will be built by an external firm (I could do it
myself, but since I'm just the summer student worker...). Anyways, to
help them, they first asked me to copy all the text from all the pages
of the site (and there is a lot!) to word documents. I found the idea
pretty stupid since style would have to be applied from scratch anyway
since we don't want to get neither old html code behind nor Microsoft
Word BS code.

I proposed to take each page and making a copy with only the text, and
with class names for the textual elements (h1, h1, p, strong, em ...)
and then define a css file giving them some style.

Now, we have around 1 600 documents do work on, and I thought I could
challenge myself a bit and automate all the dull work. I thought about
the possibility of parsing all those pages with python, ripping of the
navigations bars and just keeping the text and layout tags, and then
applying class names to specific tags. The program would also have to
remove the table where text is located in. And other difficulty is
that I want to be able to keep tables that are actually used for
tabular data and not positioning.

So, I'm writing this to have your opinion on what tools I should use
to do this and what technique I should use.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Neil Cerutti
On 2007-06-18, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I work at this company and we are re-building our website: http://caslt.org/.
 The new website will be built by an external firm (I could do it
 myself, but since I'm just the summer student worker...). Anyways, to
 help them, they first asked me to copy all the text from all the pages
 of the site (and there is a lot!) to word documents. I found the idea
 pretty stupid since style would have to be applied from scratch anyway
 since we don't want to get neither old html code behind nor Microsoft
 Word BS code.

 I proposed to take each page and making a copy with only the text, and
 with class names for the textual elements (h1, h1, p, strong, em ...)
 and then define a css file giving them some style.

 Now, we have around 1 600 documents do work on, and I thought I could
 challenge myself a bit and automate all the dull work. I thought about
 the possibility of parsing all those pages with python, ripping of the
 navigations bars and just keeping the text and layout tags, and then
 applying class names to specific tags. The program would also have to
 remove the table where text is located in. And other difficulty is
 that I want to be able to keep tables that are actually used for
 tabular data and not positioning.

 So, I'm writing this to have your opinion on what tools I
 should use to do this and what technique I should use.

You could get good results, and save yourself some effort, using
links or lynx with the command line options to dump page text to
a file. Python would still be needed to automate calling links or
lynx on all your documents.

-- 
Neil Cerutti
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden

Neil Cerutti wrote:
 You could get good results, and save yourself some effort, using
 links or lynx with the command line options to dump page text to
 a file. Python would still be needed to automate calling links or
 lynx on all your documents.

OP was looking for a way to parse out part of the file and apply classes to 
certain types of tags. Using lynx/links wouldn't help, since the output of 
links or lynx is going to end up as plain text and the desire isn't to strip 
all the formatting. 

Someone else mentioned lxml but as I understand it lxml will only work if it's 
valid XHTML that they're working with. Assuming it's not (since real-world HTML 
almost never is), perhaps BeautifulSoup will fare better. 

http://www.crummy.com/software/BeautifulSoup/documentation.html

-Jay
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
Jay Loden wrote:
 Someone else mentioned lxml but as I understand it lxml will only work if
 it's valid XHTML that they're working with.

No, it was meant as the OP requested. It even has a very good parser from
broken HTML.

http://codespeak.net/lxml/dev/parsing.html#parsing-html

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden

Stefan Behnel wrote:
 Jay Loden wrote:
 Someone else mentioned lxml but as I understand it lxml will only work if
 it's valid XHTML that they're working with.
 
 No, it was meant as the OP requested. It even has a very good parser from
 broken HTML.
 
 http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden

Stefan Behnel wrote:
 Jay Loden wrote:
 Someone else mentioned lxml but as I understand it lxml will only work if
 it's valid XHTML that they're working with.
 
 No, it was meant as the OP requested. It even has a very good parser from
 broken HTML.
 
 http://codespeak.net/lxml/dev/parsing.html#parsing-html

I stand corrected, I missed that whole part of the LXML documentation :-)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
I see there is a couple of tools I could use, and I also heard of
sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
htmllib ...

Is there any of those tools that does the job I need to do more easily
and what should I use? Maybe a combination of those tools, which one
is better for what part of the work?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 I see there is a couple of tools I could use, and I also heard of
 sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib,
 htmllib ...
 
 Is there any of those tools that does the job I need to do more easily
 and what should I use? Maybe a combination of those tools, which one
 is better for what part of the work?

Well, as I said, use lxml. It's fast, pythonically easy to use, extremely
powerful and extensible. Apart from being the main author :), I actually use
it for lots of tiny things more or less like what you're off to. It's just
plain great for a quick script that gets you from A to B for a bag of documents.

Parse it in with HTML parser (even from URLs), then use XPath to extract
(exactly) what you want and then work on it as you wish. That's short and
simple in lxml.

http://codespeak.net/lxml/dev/tutorial.html
http://codespeak.net/lxml/dev/parsing.html#parsing-html
http://codespeak.net/lxml/dev/xpathxslt.html#xpath

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML/XML documents

2007-04-26 Thread [EMAIL PROTECTED]
I need to parse real world HTML/XML documents and I found two nice python
solution: BeautifulSoup and Tidy.

However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
Gecko surely handles bad html in a more consistent and error-proof way
than BS and Tidy.

I'm interested in using Mozilla DOM from inside a Python script, however
I'm a bit confused about how can I use pyXPCOM to accomplish this job.

Any suggestions?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML/XML documents

2007-04-26 Thread Stefan Behnel
[EMAIL PROTECTED] wrote:
 I need to parse real world HTML/XML documents and I found two nice python
 solution: BeautifulSoup and Tidy.

There's also lxml, in case you want a real XML tool.
http://codespeak.net/lxml/
http://codespeak.net/lxml/dev/parsing.html#parsers


 However I found pyXPCOM that is a wrapper for Gecko. So I was thinking
 Gecko surely handles bad html in a more consistent and error-proof way
 than BS and Tidy.
 
 I'm interested in using Mozilla DOM from inside a Python script, however
 I'm a bit confused about how can I use pyXPCOM to accomplish this job.

I've never used it, but I doubt Gecko would yield substantially better results
than any of the three above. You're dealing with broken data here, so it just
depends on your input which one of them wins.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML/XML documents

2007-04-26 Thread Max M
Stefan Behnel skrev:
 [EMAIL PROTECTED] wrote:
 I need to parse real world HTML/XML documents and I found two nice python
 solution: BeautifulSoup and Tidy.
 
 There's also lxml, in case you want a real XML tool.
 http://codespeak.net/lxml/
 http://codespeak.net/lxml/dev/parsing.html#parsers

I have used both BeautiullSoup and lxml. They are both good tools.

lxml is blindingly fast compared to BeautifulSoup though.

A simple tool for importing contact information from 6000 xml files of 
23 MBytes into Zope runs in about 30 seconds. No optimisations at all. 
Just inefficient xpath expressions.

That is pretty good in my book.

-- 

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-23 Thread sofeng
On Feb 8, 11:43 am, metaperl [EMAIL PROTECTED] wrote:
 On Feb 8, 2:38 pm, mtuller [EMAIL PROTECTED] wrote:

  I am trying to parse a webpage and extract information.

 BeautifulSoup is a great Python module for this purpose:

http://www.crummy.com/software/BeautifulSoup/

 Here's an article on screen scraping using it:

http://iwiwdsmi.blogspot.com/2007/01/how-to-use-python-and-beautiful-...

This article has moved to 
http://iwiwdsmp.blogspot.com/2007/02/how-to-use-python-and-beautiful-soup-to.html

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-23 Thread John Nagle
BeautifulSoup does parse HTML well, but there are a few issues:

1.  It's rather slow; it can take seconds of CPU time to parse
some larger web pages.

2.  There's no error reporting.  It tries to do the right thing,
but when it doesn't, you have no idea what went wrong.

BeautifulSoup would be a good test case for the PyPy crowd to
work on.  It really needs the speedup.

John Nagle

sofeng wrote:
 On Feb 8, 11:43 am, metaperl [EMAIL PROTECTED] wrote:
On Feb 8, 2:38 pm, mtuller [EMAIL PROTECTED] wrote:
I am trying to parse a webpage and extract information.
BeautifulSoup is a great Python module for this purpose:
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-14 Thread Frederic Rentsch
mtuller wrote:
 Alright. I have tried everything I can find, but am not getting
 anywhere. I have a web page that has data like this:

 tr 
 td headers=col1_1  style=width:21%   
 span  class=hpPageText LETTER/span/td
 td headers=col2_1  style=width:13%; text-align:right   
 span  class=hpPageText 33,699/span/td
 td headers=col3_1  style=width:13%; text-align:right   
 span  class=hpPageText 1.0/span/td
 td headers=col4_1  style=width:13%; text-align:right   
 /tr

 What is show is only a small section.

 I want to extract the 33,699 (which is dynamic) and set the value to a
 variable so that I can insert it into a database. I have tried parsing
 the html with pyparsing, and the examples will get it to print all
 instances with span, of which there are a hundred or so when I use:

 for srvrtokens in printCount.searchString(printerListHTML):
   print srvrtokens

 If I set the last line to srvtokens[3] I get the values, but I don't
 know grab a single line and then set that as a variable.

 I have also tried Beautiful Soup, but had trouble understanding the
 documentation, and HTMLParser doesn't seem to do what I want. Can
 someone point me to a tutorial or give me some pointers on how to
 parse html where there are multiple lines with the same tags and then
 be able to go to a certain line and grab a value and set a variable's
 value to that?


 Thanks,

 Mike

   
Posted problems rarely provide exhaustive information. It's just not 
possible. I have been taking shots in the dark of late suggesting a 
stream-editing approach to extracting data from htm files. The 
mainstream approach is to use a parser (beautiful soup or pyparsing).
  Often times nothing more is attempted than the location and 
extraction of some text irrespective of page layout. This can sometimes 
be done with a simple regular expression, or with a stream editor if a 
regular expression gets too unwieldy. The advantage of the stream editor 
over a parser is that it doesn't mobilize an arsenal of unneeded 
functionality and therefore tends to be easier, faster and shorter to 
implement. The editor's inability to understand structure isn't a 
shortcoming when structure doesn't matter and can even be an advantage 
in the presence of malformed input that sends a parser on a tough and 
potentially hazardous mission for no purpose at all.
  SE doesn't impose the study of massive documentation, nor the 
memorization of dozens of classes, methods and what not. The following 
four lines would solve the OP's problem (provided the post really is all 
there is to the problem):


  import re, SE# http://cheeseshop.python.org/pypi/SE/2.3

  Filter = SE.SE ('EAT ~(?i)col[0-9]_[0-9](.|\n)*?/td~==SOME 
SPLIT MARK')

  r = re.compile ('(?i)(col[0-9]_[0-9])(.|\n)*?([0-9,]+)/span')

  for line in Filter (s).split ('SOME SPLIT MARK'):
  print r.search (line).group (1, 3)

('col2_1', '33,699')
('col3_1', '0')
('col4_1', '7,428')


---

Input:

  s = '''
td headers=col1_1  style=width:21%   
span  class=hpPageText LETTER/span/td
td headers=col2_1  style=width:13%; text-align:right   
span  class=hpPageText 33,699/span/td
td headers=col3_1  style=width:13%; text-align:right   
span  class=hpPageText 1.0/span/td
td headers=col5_1  style=width:13%; text-align:right   
span  class=hppagetext 7,428/span/td
/tr'''

The SE object handles file input too:

  for line in Filter ('file_name', '').split ('SOME SPLIT MARK'):  # 
'' commands string output
  print r.search (line).group (1, 3)





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-11 Thread Samuel Karl Peterson
mtuller [EMAIL PROTECTED] on 10 Feb 2007 15:03:36 -0800 didst
step forth and proclaim thus:

 Alright. I have tried everything I can find, but am not getting
 anywhere. I have a web page that has data like this:

[snip]

 What is show is only a small section.
 
 I want to extract the 33,699 (which is dynamic) and set the value to a
 variable so that I can insert it into a database.

[snip]

 I have also tried Beautiful Soup, but had trouble understanding the
 documentation.


from BeautifulSoup import BeautifulSoup as parser

soup = parser(tr 
td headers=col1_1  style=width:21%   
span  class=hpPageText LETTER/span/td
td headers=col2_1  style=width:13%; text-align:right   
span  class=hpPageText 33,699/span/td
td headers=col3_1  style=width:13%; text-align:right   
span  class=hpPageText 1.0/span/td
td headers=col4_1  style=width:13%; text-align:right   
/tr)

value = \
   int(soup.find('td', headers='col2_1').span.contents[0].replace(',', ''))


 Thanks,

 Mike

Hope that helped.  This code assumes there aren't any td tags with
header=col2_1 that come before the value you are trying to extract.
There's several ways to do things in BeautifulSoup.  You should play
around with BeautifulSoup in the interactive prompt.  It's simply
awesome if you don't need speed on your side.

-- 
Sam Peterson
skpeterson At nospam ucdavis.edu
if programmers were paid to remove code instead of adding it,
software would be much better -- unknown
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-11 Thread Paul McGuire
On Feb 10, 5:03 pm, mtuller [EMAIL PROTECTED] wrote:
 Alright. I have tried everything I can find, but am not getting
 anywhere. I have a web page that has data like this:

 tr 
 td headers=col1_1  style=width:21%   
 span  class=hpPageText LETTER/span/td
 td headers=col2_1  style=width:13%; text-align:right   
 span  class=hpPageText 33,699/span/td
 td headers=col3_1  style=width:13%; text-align:right   
 span  class=hpPageText 1.0/span/td
 td headers=col4_1  style=width:13%; text-align:right   
 /tr

 What is show is only a small section.

 I want to extract the 33,699 (which is dynamic) and set the value to a
 variable so that I can insert it into a database. I have tried parsing
 the html with pyparsing, and the examples will get it to print all
 instances with span, of which there are a hundred or so when I use:

 for srvrtokens in printCount.searchString(printerListHTML):
 print srvrtokens

 If I set the last line to srvtokens[3] I get the values, but I don't
 know grab a single line and then set that as a variable.


So what you are saying is that you need to make your pattern more
specific.  So I suggest adding these items to your matching pattern:
- only match span if inside a td with attribute 'headers=col2_1'
- only match if the span body is an integer (with optional comma
separater for thousands)

This grammar adds these more specific tests for matching the input
HTML (note also the use of results names to make it easy to extract
the integer number, and a parse action added to integer to convert the
'33,699' string to the integer 33699).

-- Paul


htmlSource = tr 
td headers=col1_1  style=width:21%   
span  class=hpPageText LETTER/span/td
td headers=col2_1  style=width:13%; text-align:right   
span  class=hpPageText 33,699/span/td
td headers=col3_1  style=width:13%; text-align:right   
span  class=hpPageText 1.0/span/td
td headers=col4_1  style=width:13%; text-align:right   
/tr

from pyparsing import makeHTMLTags, Word, nums, ParseException

tdStart, tdEnd = makeHTMLTags('td')
spanStart, spanEnd = makeHTMLTags('span')

def onlyAcceptWithTagAttr(attrname,attrval):
def action(tagAttrs):
if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
raise ParseException(,0,)
return action

tdStart.setParseAction(onlyAcceptWithTagAttr(headers,col2_1))
spanStart.setParseAction(onlyAcceptWithTagAttr(class,hpPageText))

integer = Word(nums,nums+',')
integer.setParseAction(lambda t:int(.join(c for c in t[0] if c !=
',')))

patt = tdStart + spanStart + integer.setResultsName(intValue) +
spanEnd + tdEnd

for matches in patt.searchString(htmlSource):
print matches.intValue

prints:
33699


-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML

2007-02-10 Thread mtuller
Alright. I have tried everything I can find, but am not getting
anywhere. I have a web page that has data like this:

tr 
td headers=col1_1  style=width:21%   
span  class=hpPageText LETTER/span/td
td headers=col2_1  style=width:13%; text-align:right   
span  class=hpPageText 33,699/span/td
td headers=col3_1  style=width:13%; text-align:right   
span  class=hpPageText 1.0/span/td
td headers=col4_1  style=width:13%; text-align:right   
/tr

What is show is only a small section.

I want to extract the 33,699 (which is dynamic) and set the value to a
variable so that I can insert it into a database. I have tried parsing
the html with pyparsing, and the examples will get it to print all
instances with span, of which there are a hundred or so when I use:

for srvrtokens in printCount.searchString(printerListHTML):
print srvrtokens

If I set the last line to srvtokens[3] I get the values, but I don't
know grab a single line and then set that as a variable.

I have also tried Beautiful Soup, but had trouble understanding the
documentation, and HTMLParser doesn't seem to do what I want. Can
someone point me to a tutorial or give me some pointers on how to
parse html where there are multiple lines with the same tags and then
be able to go to a certain line and grab a value and set a variable's
value to that?


Thanks,

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML

2007-02-08 Thread mtuller
I am trying to parse a webpage and extract information. I am trying to
use pyparser. Here is what I have:

from pyparsing import *
import urllib

# define basic text pattern
spanStart = Literal('span class=\hpPageText\')

spanEnd = Literal('/span/td')

printCount = spanStart + SkipTo(spanEnd) + spanEnd

# get printer addresses
printerURL = http://printer.mydomain.com/hp/device/this.LCDispatcher?
nav=hp.Usage
printerListPage = urllib.urlopen(printerURL)
printerListHTML = printerListPage.read()
printerListPage.close

for srvrtokens,startloc,endloc in
printCount.scanString(printerListHTML): print srvrtokens

print printCount


I have the last print statement to check what is being sent because I
am getting nothing back. What it sends is:
{span class=hpPageText SkipTo:(/span/td) /span/td}

If I pull out the hpPageText I get results back, but more than what
I want. I know it has something to do with escaping the quotation
marks, but I am puzzled as to how to do it.


Thanks,

Mike

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-08 Thread metaperl
On Feb 8, 2:38 pm, mtuller [EMAIL PROTECTED] wrote:
 I am trying to parse a webpage and extract information.

BeautifulSoup is a great Python module for this purpose:

http://www.crummy.com/software/BeautifulSoup/

Here's an article on screen scraping using it:


http://iwiwdsmi.blogspot.com/2007/01/how-to-use-python-and-beautiful-soup-to.html

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-08 Thread mtuller
I was asking how to escape the quotation marks. I have everything
working in pyparser except for that. I don't want to drop everything
and go to a different parser.

Can someone else help?




  I am trying to parse a webpage and extract information.

 BeautifulSoup is a great Python module for this purpose:

http://www.crummy.com/software/BeautifulSoup/

 Here's an article on screen scraping using it:

http://iwiwdsmi.blogspot.com/2007/01/how-to-use-python-and-beautiful-...


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML

2007-02-08 Thread Paul McGuire
On Feb 8, 4:15 pm, mtuller [EMAIL PROTECTED] wrote:
 I was asking how to escape the quotation marks. I have everything
 working in pyparser except for that. I don't want to drop everything
 and go to a different parser.

 Can someone else help?


Mike -

pyparsing includes a helper for constructing HTML tags called
makeHTMLTags.  This method does more than just wrap the given tag text
within 's, but also comprehends attributes, upper/lower case, and
various styles of quoted strings.  To use it, replace your Literal
definitions for spanStart and spanEnd with:

spanStart, spanEnd = makeHTMLTags('span')

If you don't want to match just *any* span tag, but say, you only
want those with the class = hpPageText, then add this parse action
to spanStart:

def onlyAcceptWithTagAttr(attrname,attrval):
def action(tagAttrs):
if not(attrname in tagAttrs and tagAttrs[attrname]==attrval):
raise ParseException(,0,)
return action

spanStart.setParseAction(onlyAcceptWithTagAttr(class,hpPageText))


-- Paul


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression help for parsing html tables

2006-10-29 Thread Odalrick

[EMAIL PROTECTED] skrev:

 Hello,

 I am having some difficulty creating a regular expression for the
 following string situation in html. I want to find a table that has
 specific text in it and then extract the html just for that immediate
 table.

 the string would look something like this:

 ...stuff here...
 table
 ...stuff here...
 table
 ...stuff here...
 table
 ...
 text i'm searching for
 ...
 /table
 ...stuff here...
 /table
 ...stuff here...
 /table
 ...stuff here...


 My question:  is there a way in RE to say:   when I find this text I'm
 looking for, search backwards and find the immediate instance of the
 string table  and then search forwards and find the immediate
 instance of the string /table.   ?

 any help is appreciated.

 Steve.

It would have been easier if you'd said what the text you are looking
for is, but I think:

regex = re.compile( r'table(.*?text you are looking for.*?)/table',
re.DOTALL )
match = regex.search( html_string )
found_table = match.group( 1 )

would work.

/Odalrick

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression help for parsing html tables

2006-10-29 Thread Paddy

[EMAIL PROTECTED] wrote:
 Hello,

 I am having some difficulty creating a regular expression for the
 following string situation in html. I want to find a table that has
 specific text in it and then extract the html just for that immediate
 table.

 the string would look something like this:

 ...stuff here...
 table
 ...stuff here...
 table
 ...stuff here...
 table
 ...
 text i'm searching for
 ...
 /table
 ...stuff here...
 /table
 ...stuff here...
 /table
 ...stuff here...


 My question:  is there a way in RE to say:   when I find this text I'm
 looking for, search backwards and find the immediate instance of the
 string table  and then search forwards and find the immediate
 instance of the string /table.   ?

 any help is appreciated.

 Steve.

Might searching the output of BeautifulSoup(html).prettify() make
things easier?

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20HTML

- Paddy

-- 
http://mail.python.org/mailman/listinfo/python-list


Regular Expression help for parsing html tables

2006-10-28 Thread steve551979
Hello,

I am having some difficulty creating a regular expression for the
following string situation in html. I want to find a table that has
specific text in it and then extract the html just for that immediate
table.

the string would look something like this:

...stuff here...
table
...stuff here...
table
...stuff here...
table
...
text i'm searching for
...
/table
...stuff here...
/table
...stuff here...
/table
...stuff here...


My question:  is there a way in RE to say:   when I find this text I'm
looking for, search backwards and find the immediate instance of the
string table  and then search forwards and find the immediate
instance of the string /table.   ?

any help is appreciated.

Steve.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Regular Expression help for parsing html tables

2006-10-28 Thread Stefan Behnel
Hi Steve,

[EMAIL PROTECTED] wrote:
 I am having some difficulty creating a regular expression for the
 following string situation in html. I want to find a table that has
 specific text in it and then extract the html just for that immediate
 table.

Any reason why you can't use a real HTML parser and API (e.g. the one provided
by lxml)? That can really make things easier here.

http://codespeak.net/lxml/
http://codespeak.net/lxml/api.html#parsers
http://codespeak.net/lxml/api.html#trees-and-documents
http://effbot.org/zone/element-index.htm

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-08 Thread Fredrik Lundh
Kenneth McDonald wrote:

 The problem I'm having with HTMLParser is simple; I don't seem to be 
 getting the actual text in the HTML document. I've implemented the 
 do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
 it never seems to receive any data. Is there another way to access the 
 text chunks as they come along?

the method is called handle_data:

 http://docs.python.org/lib/module-HTMLParser.html

 HTMLParser would probably be the way to go if I can figure this out. It 
 seems much simpler than htmllib, and satisfies my requirements.
 
 htmllib will write out the text data (using the AbstractFormatter and 
 AbstractWriter), but my problem here is conceptual. I simply don't 
 understand why all of these different levels of abstractness are 
 necessary, nor how to use them.

if you're not interested in HTML *rendering*, use sgmllib instead.

 http://docs.python.org/lib/module-sgmllib.html

the only difference between the libs is that HTMLParser is a bit 
stricter; on the other hand, if you want to parse really messy HTML, you 
should probably use BeautifulSoup instead:

 http://www.crummy.com/software/BeautifulSoup/

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-08 Thread Fredrik Lundh
Fredrik Lundh wrote:

 the only difference between the libs (*) is that HTMLParser is a bit 
 stricter

*) the libs referring to htmllib and HTMLParser, not htmllib and sgmllib.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-07 Thread Kenneth McDonald
I'm writing a program that will parse HTML and (mostly) convert it to 
MediaWiki format. The two Python modules I'm aware of to do this are 
HTMLParser and htmllib. However, I'm currently experiencing either real 
or conceptual difficulty with both, and was wondering if I could get 
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be 
getting the actual text in the HTML document. I've implemented the 
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
it never seems to receive any data. Is there another way to access the 
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It 
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and 
AbstractWriter), but my problem here is conceptual. I simply don't 
understand why all of these different levels of abstractness are 
necessary, nor how to use them. As an example, the html itext/i 
should be converted to ''text'' (double single-quotes at each end) in my 
mediawiki markup output. This would obviously be easy to achieve if I 
simply had an html parse that called a method for each start tag, text 
chunk, and end tag. But htmllib calls the tag functions in HTMLParser, 
and then does more things with both a formatter and a writer. To me, 
both seem unnecessarily complex (though I suppose I can see the benefits 
of a writer before generators gave the opportunity to simply yield 
chunks of output to be processed by external code.) In any case, I don't 
really have a good idea of what I should do with htmllib to get my 
converted tags, and then content, and then closing converted tags, 
written out.

Please feel free to point to examples, code, etc. Probably the simplest 
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-07 Thread wes weston
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
 def __init__(self):
 HTMLParser.__init__(self)
 self.TokenList = []
 def handle_data( self,data):
 data = data.strip()
 if data and len(data)  0:
 self.TokenList.append(data)
 #print data
 def GetTokenList(self):
 return self.TokenList


try:
 url = http://your url here..
 f = urllib.urlopen(url)
 res = f.read()
 f.close()
except:
 print bad read
 return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()


Kenneth McDonald wrote:
 I'm writing a program that will parse HTML and (mostly) convert it to 
 MediaWiki format. The two Python modules I'm aware of to do this are 
 HTMLParser and htmllib. However, I'm currently experiencing either real 
 or conceptual difficulty with both, and was wondering if I could get 
 some advice.
 
 The problem I'm having with HTMLParser is simple; I don't seem to be 
 getting the actual text in the HTML document. I've implemented the 
 do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but 
 it never seems to receive any data. Is there another way to access the 
 text chunks as they come along?
 
 HTMLParser would probably be the way to go if I can figure this out. It 
 seems much simpler than htmllib, and satisfies my requirements.
 
 htmllib will write out the text data (using the AbstractFormatter and 
 AbstractWriter), but my problem here is conceptual. I simply don't 
 understand why all of these different levels of abstractness are 
 necessary, nor how to use them. As an example, the html itext/i 
 should be converted to ''text'' (double single-quotes at each end) in my 
 mediawiki markup output. This would obviously be easy to achieve if I 
 simply had an html parse that called a method for each start tag, text 
 chunk, and end tag. But htmllib calls the tag functions in HTMLParser, 
 and then does more things with both a formatter and a writer. To me, 
 both seem unnecessarily complex (though I suppose I can see the benefits 
 of a writer before generators gave the opportunity to simply yield 
 chunks of output to be processed by external code.) In any case, I don't 
 really have a good idea of what I should do with htmllib to get my 
 converted tags, and then content, and then closing converted tags, 
 written out.
 
 Please feel free to point to examples, code, etc. Probably the simplest 
 solution would be a way to process text content in HTMLParser.HTMLParser.
 
 Thanks,
 Ken
-- 
http://mail.python.org/mailman/listinfo/python-list


sample code for parsing html file to get contents of td fields

2005-08-04 Thread yaffa
does anyone have sample code for parsting an html file to get contents
of a td field to write to a mysql db?  even if you have everything but
the mysql db part ill take it.

thanks

yaffa

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread William Park
yaffa [EMAIL PROTECTED] wrote:
 does anyone have sample code for parsting an html file to get contents
 of a td field to write to a mysql db?  even if you have everything but
 the mysql db part ill take it.

I usually use Expat XML parser to extract the field.
http://home.eol.ca/~parkw/index.html#expat

Expat is everywhere.  Python has it and even Gawk has it.

-- 
William Park [EMAIL PROTECTED], Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
  http://freshmeat.net/projects/bashdiff/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread Bill Mill
On 4 Aug 2005 11:54:38 -0700, yaffa [EMAIL PROTECTED] wrote:
 does anyone have sample code for parsting an html file to get contents
 of a td field to write to a mysql db?  even if you have everything but
 the mysql db part ill take it.
 

Do you want something like this?

In [1]: x = something tdbsomething/b else/td and\nanother thing tdin
a td/td and again else

In [2]: import re

In [3]: r = re.compile('td(.*?)/td', re.S)

In [4]: r.findall(x)
Out[4]: ['bsomething/b else', 'in a td']

If not, you'll have to explain more clearly what you want.

Peace
Bill Mill
bill.mill at gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread Kent Johnson
yaffa wrote:
 does anyone have sample code for parsting an html file to get contents
 of a td field to write to a mysql db?  even if you have everything but
 the mysql db part ill take it.

http://www.crummy.com/software/BeautifulSoup/examples.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html :: output to comma delimited

2005-07-17 Thread samuels
Thanks for the replies,  I'll post here when/if I get it finally
working.

So, now I know how to extract the links for the big page, and extract
the text from the individual page.  Really what I need to find out is
how run the script on each individual page automatically, and get the
output in comma delimited format.  Thanks for solving the two problems
though :)

-Sam

-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing html :: output to comma delimited

2005-07-16 Thread samuels
Hello All,

I am a total python newbie, and I need help writing a script.

This is what I want to do:

There is a list of links at http://www.rentalhq.com/fulllist.asp.  Each
link goes to a page like,
http://www.rentalhq.com/store.asp?id=907%2F272%2D4425, that contains a
company name, address, phone, and fax.  I want extract each page, parse
this information, and export it to a comma delimited text file, or tab
delimited.  The important information in each page is:

table border=0 cellpadding=0 cellspacing=0
style=border-collapse: collapse bordercolor=#11 width=100%
id=AutoNumber1
  tr
td width=100% colspan=2
h2 style=text-align: center; margin-top:2; margin-bottom:2;
line-height:14px class=title
font size=4United Rentals Inc./font
/h2

h3 style=text-align: center; margin-top:4;
margin-bottom:43401nbsp;Commercialnbsp;Dr.nbsp;
Anchoragenbsp;AK,nbsp;99501-3024
/h3
p style=text-align: center; margin-top:4; margin-bottom:4
a target=_blank
href=http://maps.google.com/maps?q=3401+Commercial+Dr%2E Anchorage AK
99501-3024 
!--a target=_blank
href=http://www.mapquest.com/maps/map.adp?city=Anchoragestate=AKaddress=3401+Commercial+Dr.zip=99501-3024country=zoom=8;--
img height=15 src=Scraps/Rental_Images/map.gif width=33
border=0/a
/p
/td
  /tr
  tr
td width=50% valign=top
p style=text-align: center; line-height:100%; margin-top:0;
margin-bottom:0nbsp;
/p
p style=text-align: center; line-height: 100%; margin-top:0;
margin-bottom:0
bPhone/b - 907/272-4425br
 bFax/b - 907/272-9683 /p

So from that I want output like :

United Rentals Inc.,3401 Commercial
Dr.,Anchorage,AK,995013024,9072724425,9072729683

or

United Rentals Inc. 3401 Commercial
Dr. Anchorage   AK  995013024   9072724425  9072729683


I have been messing around with beautiful soup
(http://www.crummy.com/software/BeautifulSoup/index.html) but haven't
gotten very far. (specially because the html is so sloppy)

Any help would be really appreciated!  Just point me in the right
direction, what to use, examples...  Thanks!

-Sam

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing html :: output to comma delimited

2005-07-16 Thread Paul McGuire
Pyparsing includes a sample program for extracting URLs from web pages.
 You should be able to adapt it to this problem.

Download pyparsing at http://pyparsing.sourceforge.net

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Parsing HTML with JavaScript

2005-05-13 Thread mtfulmer
I am trying to extract some information from a few web pages, and I was
using the HTMLParser module. It worked fine until it got to the
javascript, at which it gave a parse error. Is there a good way to work
around this or should I just preparse the file to remove the javascript
manually? This is my first python program. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML with JavaScript

2005-05-13 Thread Richard Brodie

[EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]

 I am trying to extract some information from a few web pages, and I was
 using the HTMLParser module. It worked fine until it got to the
 javascript, at which it gave a parse error.

It's fairly common for pages with Javascript to also be invalid HTML.
HTMLParser isn't an 'ignore all errors silently and guess what it's
meant to be' parser. Unless you have known good inputs it's often
best to use an alternative. Some options are discussed in Uche's article
here: http://www.xml.com/pub/a/2004/09/08/pyxml.html


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Parsing HTML with JavaScript

2005-05-13 Thread John J. Lee
[EMAIL PROTECTED] writes:

 I am trying to extract some information from a few web pages, and I was
 using the HTMLParser module. It worked fine until it got to the
 javascript, at which it gave a parse error. Is there a good way to work
 around this or should I just preparse the file to remove the javascript
 manually? This is my first python program. 

sgmllib is very similar to HTMLParser, but doesn't break so easily
(but sgmllib has some problems with XHTML -- swings and roundabouts).

Or, try BeautifulSoup.


John
-- 
http://mail.python.org/mailman/listinfo/python-list