Re: Parsing HTML with xml.etree in Python 2.7?

2015-10-05 Thread Skip Montanaro
On Mon, Oct 5, 2015 at 9:14 AM, Skip Montanaro wrote: > I wouldn't be surprised if there were some small API changes other than the > name change caused by the move into the xml package. Before I dive into a > rabbit hole and start to modify elementtidy, is there some other stdlib-only > way to pa

Parsing HTML with xml.etree in Python 2.7?

2015-10-05 Thread Skip Montanaro
Back before Fredrik Lundh's elementtree module was sucked into the Python stdlib as xml.etree, I used to use his elementtidy extension module to clean up HTML source so it could be parsed into an ElementTree object. Elementtidy hasn't be updated in about ten years, and still assumes there is a modu

Re: Parsing html with Beautifulsoup

2009-12-14 Thread Gabriel Genellina
En Mon, 14 Dec 2009 03:58:34 -0300, Johann Spies escribió: On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote: cell.findAll(text=True) returns a list of all text nodes inside a cell; I preprocess all \n and   in each text node, and join them all. lines is a list of lists (each

Re: Parsing html with Beautifulsoup

2009-12-13 Thread Johann Spies
On Sun, Dec 13, 2009 at 07:58:55AM -0300, Gabriel Genellina wrote: > this code should serve as a starting point: Thank you very much! > cell.findAll(text=True) returns a list of all text nodes inside a > cell; I preprocess all \n and   in each text node, and > join them all. lines is a list of

Re: Parsing html with Beautifulsoup

2009-12-13 Thread Gabriel Genellina
En Fri, 11 Dec 2009 04:04:38 -0300, Johann Spies escribió: Gabriel Genellina het geskryf: En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for  sunetint and still provide the text-parts in the

Re: Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies
Gabriel Genellina het geskryf: En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for  sunetint and still provide the text-parts in the 's with plain text? Hard to tell if we don't see what's inside

Re: Parsing html with Beautifulsoup

2009-12-10 Thread Gabriel Genellina
En Thu, 10 Dec 2009 06:15:19 -0300, Johann Spies escribió: How do I get Beautifulsoup to render (taking the above line as example) sunentint for  sunetint and still provide the text-parts in the 's with plain text? Hard to tell if we don't see what's inside those 's - please provide at

Parsing html with Beautifulsoup

2009-12-10 Thread Johann Spies
I am trying to get csv-output from a html-file. With this code I had a little success: = from BeautifulSoup import BeautifulSoup from string import replace, join import re f = open("configuration.html","r") g = open("configuration.csv",'w') soup = BeautifulSoup(f) t = soup

Re: Parsing HTML?

2008-04-26 Thread Stefan Behnel
ions should allow you to do this: import lxml.etree as et parser = etree.HTMLParser() tree = h.parse("somefile.html", parser) text = tree.xpath("string( some/[EMAIL PROTECTED] )") lxml.html is just a dedicated package that makes HTML handling beautiful. It's not required for parsing HTML and doing general XML stuff with it. Stefan -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML?

2008-04-26 Thread Benjamin
On Apr 6, 11:03 pm, Stefan Behnel <[EMAIL PROTECTED]> wrote: > Benjamin wrote: > > I'm trying to parse an HTML file.  I want to retrieve all of the text > > inside a certain tag that I find with XPath.  The DOM seems to make > > this available with the innerHTML element, but I haven't found a way >

Re: Parsing HTML?

2008-04-26 Thread Benjamin
On Apr 3, 9:10 pm, 7stud <[EMAIL PROTECTED]> wrote: > On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote: > > > BeautifulSoup does what I need it to.  Though, I was hoping to find > > something that would let me work with the DOM the way JavaScript can > > work with web browsers' implementations of the DO

Re: Parsing HTML?

2008-04-06 Thread Stefan Behnel
Benjamin wrote: > I'm trying to parse an HTML file. I want to retrieve all of the text > inside a certain tag that I find with XPath. The DOM seems to make > this available with the innerHTML element, but I haven't found a way > to do it in Python. import lxml.html as h tree = h.parse("s

Re: Parsing HTML?

2008-04-03 Thread 7stud
On Apr 3, 12:39 am, [EMAIL PROTECTED] wrote: > BeautifulSoup does what I need it to.  Though, I was hoping to find > something that would let me work with the DOM the way JavaScript can > work with web browsers' implementations of the DOM.  Specifically, I'd > like to be able to access the innerHTM

Re: Parsing HTML?

2008-04-03 Thread Larry Bates
On Wed, 2008-04-02 at 21:59 -0700, Benjamin wrote: > I'm trying to parse an HTML file. I want to retrieve all of the text > inside a certain tag that I find with XPath. The DOM seems to make > this available with the innerHTML element, but I haven't found a way > to do it in Python. I use Eleme

Re: Parsing HTML?

2008-04-03 Thread Paul Boddie
On 3 Apr, 06:59, Benjamin <[EMAIL PROTECTED]> wrote: > I'm trying to parse an HTML file. I want to retrieve all of the text > inside a certain tag that I find with XPath. The DOM seems to make > this available with the innerHTML element, but I haven't found a way > to do it in Python. With libxm

Re: Parsing HTML?

2008-04-02 Thread benash
BeautifulSoup does what I need it to. Though, I was hoping to find something that would let me work with the DOM the way JavaScript can work with web browsers' implementations of the DOM. Specifically, I'd like to be able to access the innerHTML element of a DOM element. Python's built-in HTMLPar

Re: Parsing HTML?

2008-04-02 Thread Daniel Fetchinson
> I'm trying to parse an HTML file. I want to retrieve all of the text > inside a certain tag that I find with XPath. The DOM seems to make > this available with the innerHTML element, but I haven't found a way > to do it in Python. Have you tried http://www.google.com/search?q=python+html+parse

Parsing HTML?

2008-04-02 Thread Benjamin
I'm trying to parse an HTML file. I want to retrieve all of the text inside a certain tag that I find with XPath. The DOM seems to make this available with the innerHTML element, but I haven't found a way to do it in Python. -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
codespeak.net/lxml/dev/tutorial.html http://codespeak.net/lxml/dev/parsing.html#parsing-html http://codespeak.net/lxml/dev/xpathxslt.html#xpath Stefan -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
I see there is a couple of tools I could use, and I also heard of sgmllib and htmllib. So now there is lxml, Beautiful soup, sgmllib, htmllib ... Is there any of those tools that does the job I need to do more easily and what should I use? Maybe a combination of those tools, which one is better fo

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden
> > http://codespeak.net/lxml/dev/parsing.html#parsing-html I stand corrected, I missed that whole part of the LXML documentation :-) -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden
> > http://codespeak.net/lxml/dev/parsing.html#parsing-html I stand corrected, I missed that whole part of the LXML documentation :-) -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
Jay Loden wrote: > Someone else mentioned lxml but as I understand it lxml will only work if > it's valid XHTML that they're working with. No, it was meant as the OP requested. It even has a very good parser from broken HTML. http://codespeak.net/lxml/dev/parsing.html#pars

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Jay Loden
Neil Cerutti wrote: > You could get good results, and save yourself some effort, using > links or lynx with the command line options to dump page text to > a file. Python would still be needed to automate calling links or > lynx on all your documents. OP was looking for a way to parse out part of

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Neil Cerutti
On 2007-06-18, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > I work at this company and we are re-building our website: http://caslt.org/. > The new website will be built by an external firm (I could do it > myself, but since I'm just the summer student worker...). Anyways, to > help them, they fi

Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread sebzzz
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student worker...). Anyways, to help them, they first asked me to copy all the text from all the pages of the sit

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: > I work at this company and we are re-building our website: http://caslt.org/. > The new website will be built by an external firm (I could do it > myself, but since I'm just the summer student worker...). Anyways, to > help them, they first asked me to copy all the text f

Re: Parsing HTML, extracting text and changing attributes.

2007-06-18 Thread Rob Wolfe
[EMAIL PROTECTED] wrote: > So, I'm writing this to have your opinion on what tools I should use > to do this and what technique I should use. Take a look at parsing example on this page: http://wiki.python.org/moin/SimplePrograms -- HTH, Rob -- http://mail.python.org/mailman/listinfo/python-l

Re: Parsing HTML/XML documents

2007-04-26 Thread Max M
Stefan Behnel skrev: > [EMAIL PROTECTED] wrote: >> I need to parse real world HTML/XML documents and I found two nice python >> solution: BeautifulSoup and Tidy. > > There's also lxml, in case you want a real XML tool. > http://codespeak.net/lxml/ > http://codespeak.net/lxml/dev/parsing.html#parse

Re: Parsing HTML/XML documents

2007-04-26 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: > I need to parse real world HTML/XML documents and I found two nice python > solution: BeautifulSoup and Tidy. There's also lxml, in case you want a real XML tool. http://codespeak.net/lxml/ http://codespeak.net/lxml/dev/parsing.html#parsers > However I found pyXPCOM th

Parsing HTML/XML documents

2007-04-26 Thread [EMAIL PROTECTED]
I need to parse real world HTML/XML documents and I found two nice python solution: BeautifulSoup and Tidy. However I found pyXPCOM that is a wrapper for Gecko. So I was thinking Gecko surely handles bad html in a more consistent and error-proof way than BS and Tidy. I'm interested in using Mozil

Re: Parsing HTML

2007-02-23 Thread John Nagle
BeautifulSoup does parse HTML well, but there are a few issues: 1. It's rather slow; it can take seconds of CPU time to parse some larger web pages. 2. There's no error reporting. It tries to do the right thing, but when it doesn't, you have no idea what went wrong. BeautifulSoup

Re: Parsing HTML

2007-02-23 Thread sofeng
On Feb 8, 11:43 am, "metaperl" <[EMAIL PROTECTED]> wrote: > On Feb 8, 2:38 pm, "mtuller" <[EMAIL PROTECTED]> wrote: > > > I am trying to parse a webpage and extract information. > > BeautifulSoup is a great Python module for this purpose: > >http://www.crummy.com/software/BeautifulSoup/ > > Her

Re: Parsing HTML

2007-02-14 Thread Frederic Rentsch
mtuller wrote: > Alright. I have tried everything I can find, but am not getting > anywhere. I have a web page that has data like this: > > > > LETTER > > 33,699 > > 1.0 > > > > What is show is only a small section. > > I want to extract the 33,699 (which is dynamic) and set the value to a >

Re: Parsing HTML

2007-02-11 Thread Paul McGuire
On Feb 10, 5:03 pm, "mtuller" <[EMAIL PROTECTED]> wrote: > Alright. I have tried everything I can find, but am not getting > anywhere. I have a web page that has data like this: > > > > LETTER > > 33,699 > > 1.0 > > > > What is show is only a small section. > > I want to extract the 33,699 (w

Re: Parsing HTML

2007-02-11 Thread Samuel Karl Peterson
"mtuller" <[EMAIL PROTECTED]> on 10 Feb 2007 15:03:36 -0800 didst step forth and proclaim thus: > Alright. I have tried everything I can find, but am not getting > anywhere. I have a web page that has data like this: [snip] > What is show is only a small section. > > I want to extract the 33,69

Parsing HTML

2007-02-10 Thread mtuller
Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: LETTER 33,699 1.0 What is show is only a small section. I want to extract the 33,699 (which is dynamic) and set the value to a variable so that I can insert it into a databa

Re: Parsing HTML

2007-02-08 Thread Paul McGuire
On Feb 8, 4:15 pm, "mtuller" <[EMAIL PROTECTED]> wrote: > I was asking how to escape the quotation marks. I have everything > working in pyparser except for that. I don't want to drop everything > and go to a different parser. > > Can someone else help? > > Mike - pyparsing includes a helper for c

Re: Parsing HTML

2007-02-08 Thread mtuller
I was asking how to escape the quotation marks. I have everything working in pyparser except for that. I don't want to drop everything and go to a different parser. Can someone else help? > > > I am trying to parse a webpage and extract information. > > BeautifulSoup is a great Python module fo

Re: Parsing HTML

2007-02-08 Thread metaperl
On Feb 8, 2:38 pm, "mtuller" <[EMAIL PROTECTED]> wrote: > I am trying to parse a webpage and extract information. BeautifulSoup is a great Python module for this purpose: http://www.crummy.com/software/BeautifulSoup/ Here's an article on screen scraping using it: http://iwiwdsmi.blogsp

Parsing HTML

2007-02-08 Thread mtuller
I am trying to parse a webpage and extract information. I am trying to use pyparser. Here is what I have: from pyparsing import * import urllib # define basic text pattern spanStart = Literal('') spanEnd = Literal('') printCount = spanStart + SkipTo(spanEnd) + spanEnd # get printer addresses p

Re: Regular Expression help for parsing html tables

2006-10-29 Thread Paddy
[EMAIL PROTECTED] wrote: > Hello, > > I am having some difficulty creating a regular expression for the > following string situation in html. I want to find a table that has > specific text in it and then extract the html just for that immediate > table. > > the string would look something like th

Re: Regular Expression help for parsing html tables

2006-10-29 Thread Odalrick
[EMAIL PROTECTED] skrev: > Hello, > > I am having some difficulty creating a regular expression for the > following string situation in html. I want to find a table that has > specific text in it and then extract the html just for that immediate > table. > > the string would look something like t

Re: Regular Expression help for parsing html tables

2006-10-28 Thread Stefan Behnel
Hi Steve, [EMAIL PROTECTED] wrote: > I am having some difficulty creating a regular expression for the > following string situation in html. I want to find a table that has > specific text in it and then extract the html just for that immediate > table. Any reason why you can't use a real HTML pa

Regular Expression help for parsing html tables

2006-10-28 Thread steve551979
Hello, I am having some difficulty creating a regular expression for the following string situation in html. I want to find a table that has specific text in it and then extract the html just for that immediate table. the string would look something like this: ...stuff here... ...stuff here...

Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-08 Thread Fredrik Lundh
Fredrik Lundh wrote: > the only difference between the libs (*) is that HTMLParser is a bit > stricter *) "the libs" referring to htmllib and HTMLParser, not htmllib and sgmllib. -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-08 Thread Fredrik Lundh
Kenneth McDonald wrote: > The problem I'm having with HTMLParser is simple; I don't seem to be > getting the actual text in the HTML document. I've implemented the > do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but > it never seems to receive any data. Is there another way

Re: Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-07 Thread wes weston
from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.TokenList = [] def handle_data( self,data): data = data.strip() if data and len(data) > 0: self.TokenList.append(data)

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllib modules.

2006-07-07 Thread Kenneth McDonald
I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently experiencing either real or conceptual difficulty with both, and was wondering if I could get some advice. T

Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread Kent Johnson
yaffa wrote: > does anyone have sample code for parsting an html file to get contents > of a td field to write to a mysql db? even if you have everything but > the mysql db part ill take it. http://www.crummy.com/software/BeautifulSoup/examples.html -- http://mail.python.org/mailman/listinfo/pyt

Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread Bill Mill
On 4 Aug 2005 11:54:38 -0700, yaffa <[EMAIL PROTECTED]> wrote: > does anyone have sample code for parsting an html file to get contents > of a td field to write to a mysql db? even if you have everything but > the mysql db part ill take it. > Do you want something like this? In [1]: x = "someth

Re: sample code for parsing html file to get contents of td fields

2005-08-04 Thread William Park
yaffa <[EMAIL PROTECTED]> wrote: > does anyone have sample code for parsting an html file to get contents > of a td field to write to a mysql db? even if you have everything but > the mysql db part ill take it. I usually use Expat XML parser to extract the field. http://home.eol.ca/~parkw/ind

sample code for parsing html file to get contents of td fields

2005-08-04 Thread yaffa
does anyone have sample code for parsting an html file to get contents of a td field to write to a mysql db? even if you have everything but the mysql db part ill take it. thanks yaffa -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing html :: output to comma delimited

2005-07-17 Thread samuels
Thanks for the replies, I'll post here when/if I get it finally working. So, now I know how to extract the links for the big page, and extract the text from the individual page. Really what I need to find out is how run the script on each individual page automatically, and get the output in comm

Re: Parsing html :: output to comma delimited

2005-07-16 Thread Paul McGuire
Pyparsing includes a sample program for extracting URLs from web pages. You should be able to adapt it to this problem. Download pyparsing at http://pyparsing.sourceforge.net -- Paul -- http://mail.python.org/mailman/listinfo/python-list

Re: Parsing html :: output to comma delimited

2005-07-16 Thread William Park
samuels <[EMAIL PROTECTED]> wrote: > Hello All, > > I am a total python newbie, and I need help writing a script. > > This is what I want to do: > > There is a list of links at http://www.rentalhq.com/fulllist.asp. Each > link goes to a page like, > http://www.rentalhq.com/store.asp?id=907%2F27

Parsing html :: output to comma delimited

2005-07-16 Thread samuels
Hello All, I am a total python newbie, and I need help writing a script. This is what I want to do: There is a list of links at http://www.rentalhq.com/fulllist.asp. Each link goes to a page like, http://www.rentalhq.com/store.asp?id=907%2F272%2D4425, that contains a company name, address, phon

Re: Parsing HTML with JavaScript

2005-05-13 Thread John J. Lee
[EMAIL PROTECTED] writes: > I am trying to extract some information from a few web pages, and I was > using the HTMLParser module. It worked fine until it got to the > javascript, at which it gave a parse error. Is there a good way to work > around this or should I just preparse the file to remove

Re: Parsing HTML with JavaScript

2005-05-13 Thread Richard Brodie
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I am trying to extract some information from a few web pages, and I was > using the HTMLParser module. It worked fine until it got to the > javascript, at which it gave a parse error. It's fairly common for pages with Javascript to al

Parsing HTML with JavaScript

2005-05-13 Thread mtfulmer
I am trying to extract some information from a few web pages, and I was using the HTMLParser module. It worked fine until it got to the javascript, at which it gave a parse error. Is there a good way to work around this or should I just preparse the file to remove the javascript manually? This is m