Re: HTMLParser fragility

2006-04-10 Thread John J. Lee
"Lawrence D'Oliveiro" <[EMAIL PROTECTED]> writes:

> I've been using HTMLParser to scrape Web sites. The trouble with this 
> is, there's a lot of malformed HTML out there. Real browsers have to be 
> written to cope gracefully with this, but HTMLParser does not. Not only 
> does it raise an exception, but the parser object then gets into a 
> confused state after that so you cannot continue using it.
[...]

sgmllib.SGMLParser (or htmllib.HTMLParser) is more tolerant than
HTMLParser.HTMLParser.

BeautifulSoup derives from sgmllib.SGMLParser, and introduces extra
robustness, of a sort.


John

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-07 Thread Richie Hindle

[Richie]
> But Tidy fails on huge numbers of real-world HTML pages.  [...]
> Is there a Python HTML tidier which will do as good a job as a browser?

[Walter]
> You can also use the HTML parser from libxml2

[Paul]
> libxml2 will attempt to parse HTML if asked to [...] See how it fixes
> up the mismatching tags.

Great!  Many thanks.

-- 
Richie Hindle
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-06 Thread Lawrence D'Oliveiro
In article <[EMAIL PROTECTED]>,
 Rene Pijlman <[EMAIL PROTECTED]> wrote:

>2. Use something more foregiving, like BeautifulSoup.
>http://www.crummy.com/software/BeautifulSoup/

That sounds like what I'm after!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-06 Thread Paul Boddie
Richie Hindle wrote:
>
> But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
> misspelled tags make it fail:
>
> >>> from mx.Tidy import tidy
> >>> results = tidy("Hello world!")

[Various error messages]

> Is there a Python HTML tidier which will do as good a job as a browser?

As pointed out elsewhere, libxml2 will attempt to parse HTML if asked
to:

>>> import libxml2dom
>>> d = libxml2dom.parseString("Hello 
>>> world!", html=1)
>>> print d.toString()
http://www.w3.org/TR/REC-html40/loose.dtd";>
Hello world!

See how it fixes up the mismatching tags. The libxml2dom package is
available in the usual place:

http://www.python.org/pypi/libxml2dom

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-06 Thread Walter Dörwald
Rene Pijlman wrote:
> Lawrence D'Oliveiro:
>> I've been using HTMLParser to scrape Web sites. The trouble with this 
>> is, there's a lot of malformed HTML out there. Real browsers have to be 
>> written to cope gracefully with this, but HTMLParser does not. 
> 
> There are two solutions to this:
> 
> 1. Tidy the source before parsing it.
> http://www.egenix.com/files/python/mxTidy.html
> 
> 2. Use something more foregiving, like BeautifulSoup.
> http://www.crummy.com/software/BeautifulSoup/

You can also use the HTML parser from libxml2 or any of the available
wrappers for it.

Bye,
   Walter Dörwald

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-05 Thread Richie Hindle

[Daniel]
> You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) 
> as a first step to get well formed HTML.

But Tidy fails on huge numbers of real-world HTML pages.  Simple things like
misspelled tags make it fail:

>>> from mx.Tidy import tidy
>>> results = tidy("Hello world!")
>>> print results[3]
line 1 column 7 - Warning: inserting missing 'title' element
line 1 column 13 - Error:  is not recognized!
line 1 column 13 - Warning: discarding unexpected 
line 1 column 31 - Warning: discarding unexpected 
This document has errors that must be fixed before
using HTML Tidy to generate a tidied up version.

Is there a Python HTML tidier which will do as good a job as a browser?

-- 
Richie
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-05 Thread Daniel Dittmar
Lawrence D'Oliveiro wrote:
> I've been using HTMLParser to scrape Web sites. The trouble with this 
> is, there's a lot of malformed HTML out there. Real browsers have to be 
> written to cope gracefully with this, but HTMLParser does not. Not only 
> does it raise an exception, but the parser object then gets into a 
> confused state after that so you cannot continue using it.
> 
> The way I'm currently working around this is to do a dummy pre-parsing 
> run with a dummy (non-subclassed) HTMLParser object. Every time I hit 
> HTMLParseError, I note the line number in a set of lines to skip, then 
> create a new HTMLParser object and restart the scan from the beginning, 
> skipping all the lines I've noted so far. Only when I get to the end 
> without further errors do I do the proper parse with all my appropriate 
> actions.

You could try HTMLTidy (http://www.egenix.com/files/python/mxTidy.html) 
as a first step to get well formed HTML.

Daniel
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: HTMLParser fragility

2006-04-05 Thread Rene Pijlman
Lawrence D'Oliveiro:
>I've been using HTMLParser to scrape Web sites. The trouble with this 
>is, there's a lot of malformed HTML out there. Real browsers have to be 
>written to cope gracefully with this, but HTMLParser does not. 

There are two solutions to this:

1. Tidy the source before parsing it.
http://www.egenix.com/files/python/mxTidy.html

2. Use something more foregiving, like BeautifulSoup.
http://www.crummy.com/software/BeautifulSoup/

-- 
René Pijlman
-- 
http://mail.python.org/mailman/listinfo/python-list


HTMLParser fragility

2006-04-05 Thread Lawrence D'Oliveiro
I've been using HTMLParser to scrape Web sites. The trouble with this 
is, there's a lot of malformed HTML out there. Real browsers have to be 
written to cope gracefully with this, but HTMLParser does not. Not only 
does it raise an exception, but the parser object then gets into a 
confused state after that so you cannot continue using it.

The way I'm currently working around this is to do a dummy pre-parsing 
run with a dummy (non-subclassed) HTMLParser object. Every time I hit 
HTMLParseError, I note the line number in a set of lines to skip, then 
create a new HTMLParser object and restart the scan from the beginning, 
skipping all the lines I've noted so far. Only when I get to the end 
without further errors do I do the proper parse with all my appropriate 
actions.
-- 
http://mail.python.org/mailman/listinfo/python-list