Re: HTML Parser

2013-07-02 Thread Joshua Landau
On 2 July 2013 18:43, wrote: > I could not use BeautifulSoup as I did not find an .exe file. Were you perhaps looking for a .exe file to install BeautifulSoup? It's quite plausible that a windows user like you might be dazzled at the idea of a .tar.gz. I suggest just using "pip install beautifu

Re: HTML Parser

2013-07-02 Thread Steven D'Aprano
On Tue, 02 Jul 2013 10:43:03 -0700, subhabangalore wrote: > I could not use BeautifulSoup as I did not find an .exe file. I believe that BeautifulSoup is a pure-Python module, and so does not have a .exe file. However, it does have good tutorials: https://duckduckgo.com/html/?q=beautifulsoup+tu

Re: HTML Parser

2013-07-02 Thread Neil Cerutti
On 2013-07-02, subhabangal...@gmail.com wrote: > Dear Group, > > I was looking for a good tutorial for a "HTML Parser". My > intention was to extract tables from web pages or information > from tables in web pages. > > I tried to make a search, I got HTMLParser, B

HTML Parser

2013-07-02 Thread subhabangalore
Dear Group, I was looking for a good tutorial for a "HTML Parser". My intention was to extract tables from web pages or information from tables in web pages. I tried to make a search, I got HTMLParser, BeautifulSoup, etc. HTMLParser works fine for me, but I am looking for a good t

Re: intolerant HTML parser

2010-02-12 Thread Jim
I want to thank everyone for the help, which I found very useful (the parts that I understood :-) ). Since I think there was some question, it happens that I am working under django and submitting a certain form triggers an html mail. I wanted to validate the html in some of my unit tests. It is

Re: intolerant HTML parser

2010-02-10 Thread Lawrence D'Oliveiro
In message <4b712919$0$6584$9b4e6...@newsspool3.arcor-online.net>, Stefan Behnel wrote: > Usually PyPI. Where do you think these tools come from? They don’t write themselves, you know. -- http://mail.python.org/mailman/listinfo/python-list

Re: intolerant HTML parser

2010-02-09 Thread Stefan Behnel
Lawrence D'Oliveiro, 08.02.2010 22:39: > In message <4b6fe93d$0$6724$9b4e6...@newsspool2.arcor-online.net>, Stefan > Behnel wrote: > >> - generating HTML using a tool that guarantees correct HTML output > > Where do you think these tools come from? Usually PyPI. Stefan -- http://mail.python.o

Re: intolerant HTML parser

2010-02-08 Thread Phlip
and the tweak is: parser = etree.HTMLParser(recover=False) return etree.HTML(xml, parser) That reduces tolerance. The entire assert_xml() is (apologies for wrapping lines!): def _xml_to_tree(self, xml): from lxml import etree self._xml = xml

Re: intolerant HTML parser

2010-02-08 Thread Phlip
Stefan Behnel wrote: > I don't read it that way. There's a huge difference between > > - generating HTML manually and validating (some of) it in a unit test > > and > > - generating HTML using a tool that guarantees correct HTML output > > the advantage of the second approach being that others hav

Re: intolerant HTML parser

2010-02-08 Thread Stefan Behnel
Lawrence D'Oliveiro, 08.02.2010 11:19: > In message <4b6fd672$0$6734$9b4e6...@newsspool2.arcor-online.net>, Stefan > Behnel wrote: > >> Jim, 06.02.2010 20:09: >> >>> I generate some HTML and I want to include in my unit tests a check >>> for syntax. So I am looking for a program that will compla

Re: intolerant HTML parser

2010-02-08 Thread Lawrence D'Oliveiro
In message <4b6fd672$0$6734$9b4e6...@newsspool2.arcor-online.net>, Stefan Behnel wrote: > Jim, 06.02.2010 20:09: > >> I generate some HTML and I want to include in my unit tests a check >> for syntax. So I am looking for a program that will complain at any >> syntax irregularities. > > First th

Re: intolerant HTML parser

2010-02-08 Thread Stefan Behnel
Jim, 06.02.2010 20:09: > I generate some HTML and I want to include in my unit tests a check > for syntax. So I am looking for a program that will complain at any > syntax irregularities. First thing to note here is that you should consider switching to an HTML generation tool that does this auto

Re: intolerant HTML parser

2010-02-06 Thread Nobody
On Sat, 06 Feb 2010 11:09:31 -0800, Jim wrote: > I generate some HTML and I want to include in my unit tests a check > for syntax. So I am looking for a program that will complain at any > syntax irregularities. > > I am familiar with Beautiful Soup (use it all the time) but it is > intended to

Re: intolerant HTML parser

2010-02-06 Thread Jim
Thank you, John. I did not find that by looking around; I must not have used the right words. The speed of the unit tests are not critical so this seems like the solution for me. Jim -- http://mail.python.org/mailman/listinfo/python-list

Re: intolerant HTML parser

2010-02-06 Thread John Nagle
Jim wrote: I generate some HTML and I want to include in my unit tests a check for syntax. So I am looking for a program that will complain at any syntax irregularities. I am familiar with Beautiful Soup (use it all the time) but it is intended to cope with bad syntax. I just tried feeding HTM

intolerant HTML parser

2010-02-06 Thread Jim
I generate some HTML and I want to include in my unit tests a check for syntax. So I am looking for a program that will complain at any syntax irregularities. I am familiar with Beautiful Soup (use it all the time) but it is intended to cope with bad syntax. I just tried feeding HTMLParser.HTMLP

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Tim Arnold
"Robert" wrote in message news:hk729b$na...@news.albasani.net... > Stefan Behnel wrote: >> Robert, 01.02.2010 14:36: >>> Stefan Behnel wrote: Robert, 31.01.2010 20:57: > I tried lxml, but after walking and making changes in the element > tree, > I'm forced to do a full serializ

Re: HTML Parser which allows low-keyed local changes?

2010-02-01 Thread Nobody
ginal HTML code. > makes it rather unreadable. > > is there an existing HTML parser which supports tracking/writing > back particular changes in a cautious way by just making local > changes? or a least tracks the tag start/end positions in the file? HTMLParser, sgmllib.SGMLPars

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread M.-A. Lemburg
Robert wrote: > I think you confused the logical level of what I meant with "file > position": > Of course its not about (necessarily) writing back to the same open file > (OS-level), but regarding the whole serializiation string (wherever it > is finally written to - I typically write the auto-con

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert
Stefan Behnel wrote: Robert, 01.02.2010 14:36: Stefan Behnel wrote: Robert, 31.01.2010 20:57: I tried lxml, but after walking and making changes in the element tree, I'm forced to do a full serialization of the whole document (etree.tostring(tree)) - which destroys the "human edited" format of

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Stefan Behnel
Robert, 01.02.2010 14:36: > Stefan Behnel wrote: >> Robert, 31.01.2010 20:57: >>> I tried lxml, but after walking and making changes in the element tree, >>> I'm forced to do a full serialization of the whole document >>> (etree.tostring(tree)) - which destroys the "human edited" format of the >>>

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert
Robert wrote: Stefan Behnel wrote: Robert, 31.01.2010 20:57: I tried lxml, but after walking and making changes in the element tree, I'm forced to do a full serialization of the whole document (etree.tostring(tree)) - which destroys the "human edited" format of the original HTML code. makes it

Re: HTML Parser which allows low-keyed local changes (upon serialization)

2010-02-01 Thread Robert
Stefan Behnel wrote: Robert, 31.01.2010 20:57: I tried lxml, but after walking and making changes in the element tree, I'm forced to do a full serialization of the whole document (etree.tostring(tree)) - which destroys the "human edited" format of the original HTML code. makes it rather unreadab

Re: HTML Parser which allows low-keyed local changes?

2010-02-01 Thread Stefan Behnel
Robert, 31.01.2010 20:57: > I tried lxml, but after walking and making changes in the element tree, > I'm forced to do a full serialization of the whole document > (etree.tostring(tree)) - which destroys the "human edited" format of the > original HTML code. makes it rather unreadable. What do you

HTML Parser which allows low-keyed local changes?

2010-01-31 Thread Robert
I tried lxml, but after walking and making changes in the element tree, I'm forced to do a full serialization of the whole document (etree.tostring(tree)) - which destroys the "human edited" format of the original HTML code. makes it rather unreadable. is there an existing HT

Re: Python HTML parser chokes on UTF-8 input

2008-10-17 Thread John Nagle
Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8 code, it chokes. The code is something like

Re: Python HTML parser chokes on UTF-8 input

2008-10-10 Thread Marc 'BlackJack' Rintsch
On Fri, 10 Oct 2008 00:13:36 +0200, Johannes Bauer wrote: > Terry Reedy schrieb: >> I believe you are confusing unicode with unicode encoded into bytes >> with the UTF-8 encoding. Having a problem feeding a unicode string, >> not 'UFT-8 code', which in Python can only mean a UTF-8 encoded byte >>

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
Johannes Bauer wrote: Terry Reedy schrieb: Johannes Bauer wrote: Hello group, I'm trying to use a htmllib.HTMLParser derivate class to parse a website which I fetched via httplib.HTTPConnection().request().getresponse().read(). Now the problem is: As soon as I pass the htmllib.HTMLParser UTF-8

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Jerry Hill
On Thu, Oct 9, 2008 at 4:54 PM, Johannes Bauer <[EMAIL PROTECTED]> wrote: > Hello group, > > Now when I take "website" directly from the parser, everything is fine. > However I want to do some modifications before I parse it, namely UTF-8 > modifications in the style: > > website = website.replace(

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
al not in range(128) > > When you do not bother to specify some other encoding in an encoding > operation, sgmllib or something deeper in Python tries the default > encoding, which does not work. Stop being annoyed and tell the > interpreter what you want. It is not a mind-read

Re: Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Terry Reedy
0xfc in position 0: ordinal not in range(128) When you do not bother to specify some other encoding in an encoding operation, sgmllib or something deeper in Python tries the default encoding, which does not work. Stop being annoyed and tell the interpreter what you want. It is not a mind-read

Python HTML parser chokes on UTF-8 input

2008-10-09 Thread Johannes Bauer
ttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 0: ordinal not in range(128) Annoying, IMHO, that the internal html Parser cannot cope with UTF-8 input - which should (again, IMHO) be the absolute standard for s

Re: Good HTML Parser

2008-07-17 Thread Stefan Behnel
Chris wrote: > Can anyone recommend a good HTML/XHTML parser, similar to > HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently > know that certain tags, like , are implicitly closed? I need to > iterate through the entire DOM, building up a DOM path, but the stdlib > parsers aren

Re: Good HTML Parser

2008-07-17 Thread Diez B. Roggisch
Chris wrote: > Can anyone recommend a good HTML/XHTML parser, similar to > HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently > know that certain tags, like , are implicitly closed? I need to > iterate through the entire DOM, building up a DOM path, but the stdlib > parsers are

Good HTML Parser

2008-07-17 Thread Chris
Can anyone recommend a good HTML/XHTML parser, similar to HTMLParser.HTMLParser or htmllib.HTMLParser, but able to intelligently know that certain tags, like , are implicitly closed? I need to iterate through the entire DOM, building up a DOM path, but the stdlib parsers aren't calling handle_endta

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Fuzzyman
[EMAIL PROTECTED] wrote: > Hi, I am looking for a HTML parser who can parse a given page into > a DOM tree, and can reconstruct the exact original html sources. > Strictly speaking, I should be allowed to retrieve the original > sources at each internal nodes of the DOM tree. >

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread A.T.Hofkamp
On 2008-01-23, kliu <[EMAIL PROTECTED]> wrote: > On Jan 23, 7:39 pm, "A.T.Hofkamp" <[EMAIL PROTECTED]> wrote: >> On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> >> > Hi, I am looking for a HTML parser who can parse a given

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Stefan Behnel
Hi, kliu wrote: > what I really need is the mapping between each DOM nodes and > the corresponding original source segment. I don't think that will be easy to achieve. You could get away with a parser that provides access to the position of an element in the source, and then map changes back into

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread Paul Boddie
On 23 Jan, 14:20, kliu <[EMAIL PROTECTED]> wrote: > > Thank u for your reply. but what I really need is the mapping between > each DOM nodes and the corresponding original source segment. At the risk of promoting unfashionable DOM technologies, you can at least serialise fragments of the DOM in li

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread kliu
On Jan 23, 7:39 pm, "A.T.Hofkamp" <[EMAIL PROTECTED]> wrote: > On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > > Hi, I am looking for a HTML parser who can parse a given page into > > a DOM tree, and can reconstruct the exact original

Re: Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-23 Thread A.T.Hofkamp
On 2008-01-23, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Hi, I am looking for a HTML parser who can parse a given page into > a DOM tree, and can reconstruct the exact original html sources. Why not keep a copy of the original data instead? That would be VERY MUCH SIMPLER

Is there a HTML parser who can reconstruct the original html EXACTLY?

2008-01-22 Thread ioscas
Hi, I am looking for a HTML parser who can parse a given page into a DOM tree, and can reconstruct the exact original html sources. Strictly speaking, I should be allowed to retrieve the original sources at each internal nodes of the DOM tree. I have tried Beautiful Soup who is really

Re: HTML Parser for Jython

2007-10-17 Thread Falcolas
On Oct 17, 9:50 am, Carsten Haese <[EMAIL PROTECTED]> wrote: > Recent releases of BeautifulSoup need Python 2.3+, so they won't work on > current Jython, but BeatifulSoup 1.x will work. Thank you. -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parser for Jython

2007-10-17 Thread Tim Chase
> Does anybody know of a decent HTML parser for Jython? I have to do > some screen scraping, and would rather use a tested module instead of > rolling my own. GIYF[0][1] There are the batteries-included HTMLParser[2] and htmllib[3] modules, and the ever-popular (and more developer

Re: HTML Parser for Jython

2007-10-17 Thread Carsten Haese
On Wed, 2007-10-17 at 17:36 +0200, Stefan Behnel wrote: > Falcolas wrote: > > Does anybody know of a decent HTML parser for Jython? I have to do > > some screen scraping, and would rather use a tested module instead of > > rolling my own. > > Not sure if it works, but h

Re: HTML Parser for Jython

2007-10-17 Thread Stefan Behnel
Falcolas wrote: > Does anybody know of a decent HTML parser for Jython? I have to do > some screen scraping, and would rather use a tested module instead of > rolling my own. Not sure if it works, but have you tried BeautifulSoup? Or maybe an older version of it? Stefan

HTML Parser for Jython

2007-10-17 Thread Falcolas
Does anybody know of a decent HTML parser for Jython? I have to do some screen scraping, and would rather use a tested module instead of rolling my own. Thanks! GP -- http://mail.python.org/mailman/listinfo/python-list

Re: Html parser

2007-06-15 Thread Nikita the Spider
In article <[EMAIL PROTECTED]>, Stephen R Laniel <[EMAIL PROTECTED]> wrote: > On Fri, Jun 15, 2007 at 07:11:56AM -0700, HMS Surprise wrote: > > Could you recommend an html parser that works with python (jython > > 2.2)? > > I'm new here, but I believe B

Re: Html parser

2007-06-15 Thread Lee Hinde
On Jun 15, 7:11 am, HMS Surprise <[EMAIL PROTECTED]> wrote: > Could you recommend an html parser that works with python (jython > 2.2)? HTMLParser does not seem to be in this library. To test some > of our browser based (mailnly php) code I seek for field names and > values as

Re: Html parser

2007-06-15 Thread HMS Surprise
Thanks, jh -- http://mail.python.org/mailman/listinfo/python-list

Re: Html parser

2007-06-15 Thread Stephen R Laniel
On Fri, Jun 15, 2007 at 07:11:56AM -0700, HMS Surprise wrote: > Could you recommend an html parser that works with python (jython > 2.2)? I'm new here, but I believe BeautifulSoup is the canonical answer: http://www.crummy.com/software/BeautifulSoup/ -- Stephen R. Laniel [EMAIL PROT

Html parser

2007-06-15 Thread HMS Surprise
Could you recommend an html parser that works with python (jython 2.2)? HTMLParser does not seem to be in this library. To test some of our browser based (mailnly php) code I seek for field names and values associated with them. Thanks, jh -- http://mail.python.org/mailman/listinfo/python

Re: HTML Parser in python

2007-04-06 Thread eknowles
Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/ Works, well...beautifully. -- http://mail.python.org/mailman/listinfo/python-list

Re: HTML Parser in python

2007-04-06 Thread kyosohma
On Apr 6, 1:05 pm, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > Hi, > > Is there a HTML parser (not xml) in python? > I need a html parser which has the ability to handle mal-format html > pages. > > Thank you. Yeah...it's called Beautiful Soup. ht

HTML Parser in python

2007-04-06 Thread [EMAIL PROTECTED]
Hi, Is there a HTML parser (not xml) in python? I need a html parser which has the ability to handle mal-format html pages. Thank you. -- http://mail.python.org/mailman/listinfo/python-list

Re: Using sax libxml2 html parser

2007-01-06 Thread Stefan Behnel
[EMAIL PROTECTED] wrote: > I have created an example using libxml2 based in the code that appears > in http://xmlsoft.org/python.html. > My example processes an enough amount of html files to see that the > memory consumption rises till the process ends (I check it with the > 'top' command). Try

Using sax libxml2 html parser

2007-01-05 Thread cesar . ortiz
Hi all, I have created an example using libxml2 based in the code that appears in http://xmlsoft.org/python.html. My example processes an enough amount of html files to see that the memory consumption rises till the process ends (I check it with the 'top' command). I don´t know if I am forgetting

Re: Looking for a decent HTML parser for Python...

2006-12-06 Thread hubritic
Agreed that the web sites are probably broken. Try running the HTML though HTMLTidy (http://tidy.sourceforge.net/). Doing that has allowed me to parse where I had problem such as yours. I have also had luck with BeautifulSoup, which also includes a tidy function in it. Just Another Victim of t

Re: Looking for a decent HTML parser for Python...

2006-12-06 Thread Stephen Eilert
Fredrik Lundh escreveu: > > Except it appears to be buggy or, at least, not very robust. There are > > websites for which it falsely terminates early in the parsing. > > which probably means that the sites are broken. the amount of broken > HTML on the net is staggering, as is the amount of

Re: Looking for a decent HTML parser for Python...

2006-12-05 Thread Fredrik Lundh
> Except it appears to be buggy or, at least, not very robust. There are > websites for which it falsely terminates early in the parsing. which probably means that the sites are broken. the amount of broken HTML on the net is staggering, as is the amount of code in a typical web browser

Re: Looking for a decent HTML parser for Python...

2006-12-05 Thread Just Another Victim of the Ambient Morality
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > >Okay, I think I found what I'm looking for in HTMLParser in the > HTMLParser module. Except it appears to be buggy or, at least, not very robust. There are websites for which i

Re: Looking for a decent HTML parser for Python...

2006-12-05 Thread Just Another Victim of the Ambient Morality
"Just Another Victim of the Ambient Morality" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] >I'm trying to parse HTML in a very generic way. >So far, I'm using SGMLParser in the sgmllib module. The problem is > that it forces you to parse very specific tags through object

Looking for a decent HTML parser for Python...

2006-12-05 Thread Just Another Victim of the Ambient Morality
I'm trying to parse HTML in a very generic way. So far, I'm using SGMLParser in the sgmllib module. The problem is that it forces you to parse very specific tags through object methods like start_a(), start_p() and the like, forcing you to know exactly which tags you want to handle. I

Re: html parser , unexpected '<' char in declaration

2006-02-21 Thread Jesus Rivero (Neurogeek)
Oopss! You are totally right guys, i did miss the closing '>' thinking about maybe errors in the use of ' or ". Jesus Tim Roberts wrote: >"Jesus Rivero - (Neurogeek)" <[EMAIL PROTECTED]> wrote: > > >>hmmm, that's kind of different issue then. >> >>I can guess, from the error you pasted earlie

Re: html parser , unexpected '<' char in declaration

2006-02-21 Thread Sakcee
thanks for the suggestions, this is not happening frequently, actually this is the first time I have seen this exception in the system, which means that some spam message was generated with ill-formated html. i guess the best way would be to check using regular expression and delete the unclosed t

Re: html parser , unexpected '<' char in declaration

2006-02-21 Thread Tim Roberts
"Jesus Rivero - (Neurogeek)" <[EMAIL PROTECTED]> wrote: > >hmmm, that's kind of different issue then. > >I can guess, from the error you pasted earlier, that the problem shown >is due to the fact Python is interpreting a "<" as an expression and not >as a char. review your code or try to figure out

Re: html parser , unexpected '<' char in declaration

2006-02-20 Thread Jesus Rivero - (Neurogeek)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 hmmm, that's kind of different issue then. I can guess, from the error you pasted earlier, that the problem shown is due to the fact Python is interpreting a "<" as an expression and not as a char. review your code or try to figure out the exact input

Re: html parser , unexpected '<' char in declaration

2006-02-20 Thread Sakcee
thanks for the reply well probabbly I should explain more. this is part of an email . after the mta delivers the email, it is stored in a local dir. After that the email is being parsed by the parser inside an web based imap client at display time. I dont think I have the choice of rewriting the

Re: html parser , unexpected '<' char in declaration

2006-02-20 Thread Jesus Rivero - (Neurogeek)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sakcee wrote: > html = > ' \r\n Foo foo , blah blah > ' > > html = """ Foo foo , blah blah """ Try checking your html code. It looks really messy. '

html parser , unexpected '<' char in declaration

2006-02-20 Thread Sakcee
html = ' \r\n Foo foo , blah blah ' >>> import htmllib >>> import formatter >>> parser=htmllib.HTMLParser(formatter.NullFormatter()) >>> parser.feed(html) Traceback (most recent call last): File "", line 1, in ? File "/usr/lib/python2.4/sgmllib.py", line 95, in feed self.goahead(0) File

Encoding detection in the html parser from libxml2

2006-02-07 Thread icoba
Hi, I am parsing html documents using the html parser from libxml2, and if the encoding is included in the document it works perfectly but if it is not, I think it does not work well (probably because I am doing something wrong). As it is said in http://xmlsoft.org/encoding.html the parser

Re: html parser?

2005-10-19 Thread leonardr
To extract links without the overhead of Beautiful Soup, one option is to copy what Beautiful Soup does, and write a SGMLParser subclass that only looks at 'a' tags. In general I think writing SGMLParser subclasses is a big pain (which is why I wrote Beautiful Soup), but since you only care about o

Re: html parser?

2005-10-19 Thread Laszlo Zsolt Nagy
Thorsten Kampe wrote: >* Christoph Söllner (2005-10-18 12:20 +0100) > > >>right, that's what I was looking for. Thanks very much. >> >> > >For simple things like that "BeautifulSoup" might be overkill. > >import formatter, \ > htmllib, \ > urllib > >url = 'http://python.org'

Re: html parser?

2005-10-18 Thread Paul Boddie
Thorsten Kampe wrote: > For simple things like that "BeautifulSoup" might be overkill. [HTMLParser example] I've used SGMLParser with some success before, although the SAX-style processing is objectionable to many people. One alternative is to use libxml2dom [1] and to parse documents as HTML: i

Re: html parser?

2005-10-18 Thread Thorsten Kampe
* Christoph Söllner (2005-10-18 12:20 +0100) > right, that's what I was looking for. Thanks very much. For simple things like that "BeautifulSoup" might be overkill. import formatter, \ htmllib, \ urllib url = 'http://python.org' htmlp = htmllib.HTMLParser(formatter.NullForm

Re: html parser?

2005-10-18 Thread Christoph S�llner
right, that's what I was looking for. Thanks very much. -- http://mail.python.org/mailman/listinfo/python-list

Re: html parser?

2005-10-18 Thread Laszlo Zsolt Nagy
Christoph Söllner wrote: >Hi *, > >is there a html parser available, which could i.e. extract all links from a >given text like that: >""" >BAR >BAR2 >""" > >and return a set of dicts like that: >""" >{ > [&#

html parser?

2005-10-18 Thread Christoph S�llner
Hi *, is there a html parser available, which could i.e. extract all links from a given text like that: """ BAR BAR2 """ and return a set of dicts like that: """ { ['foo.php','BAR','param1','test'],

Re: robust html parser

2005-08-15 Thread James Stroud
http://www.crummy.com/software/BeautifulSoup/ On Monday 15 August 2005 03:33 pm, BRA_MIK wrote: > I'm looking for a good and robust html parser that could parse complex > html/xhtml document without crashing (possibly free) > > Could you help me please ? > > TIA > MB

robust html parser

2005-08-15 Thread BRA_MIK
I'm looking for a good and robust html parser that could parse complex html/xhtml document without crashing (possibly free) Could you help me please ? TIA MB -- http://mail.python.org/mailman/listinfo/python-list

Behaviour of htmllib's HTML parser and formatter

2005-03-10 Thread Morten W. Petersen
Hi, I have an HTML page that displays some content, and a part of that content is HTML changed into regular text. The encoding of the page is UTF-8. Here's the code that makes the change (the HTML in self.contents is UTF-8 encoded): file = cStringIO.StringIO() parser = htmllib.HTMLParser(format