Re: beautifulsoup .vs tidy

2006-07-02 Thread uche . ogbuji
bruce wrote:
> hi paddy...
>
> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
>
>  initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
>
> the xpath/linxml functions in the perl app complain regarding the file. my
> thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
> functions are too strict!
>
> which is why i decided to see if anyone on the python side has
> experienced/solved this problem..

FWIW here's my usual approach:

http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM

Personally, I avoid Tidy.  I've too often seen it crash or hang on
really bad HTML.  TagSoup seems to be built like a tank.  I've also
never seen BeautifulSoup choke, but I don't use it as much as TagSoup.

--
Uche Ogbuji   Fourthought, Inc.
http://uche.ogbuji.nethttp://fourthought.com
http://copia.ogbuji.net   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-02 Thread Fredrik Lundh
Ravi Teja wrote:

>> Of course, lxml should be able to do this kind of thing as well. I'd be
>> interested to know why this "is not a good idea", though.
> 
> No reason that you don't know already.
> 
> http://www.boddie.org.uk/python/HTML.html
> 
> "If the document text is well-formed XML, we could omit the html
> parameter or set it to have a false value."
> 
> XML parsers are not required to be forgiving to be regarded compliant.
> And much HTML out there is not well formed.

so?  once you run it through an HTML-aware parser, the *resulting* 
structure is well formed.

a site generator->converter->xpath approach is no less reliable than any 
other HTML-scraping approach.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-01 Thread Ravi Teja

Paul Boddie wrote:
> Ravi Teja wrote:
> >
> > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> > web pages in general.
>
> import libxml2dom
> import urllib
> f = urllib.urlopen("http://wiki.python.org/moin/";)
> s = f.read()
> f.close()
> # s contains HTML not XML text
> d = libxml2dom.parseString(s, html=1)
> # get the community-related links
> for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
> print label.nodeValue

I wasn't aware that your module does html as well.

> Of course, lxml should be able to do this kind of thing as well. I'd be
> interested to know why this "is not a good idea", though.

No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-01 Thread Matt Good
bruce wrote:
> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
>
>  initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
>
> the xpath/linxml functions in the perl app complain regarding the file. my
> thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
> functions are too strict!

Clean HTML is not valid XML.  If you want to process the output with an
XML library you'll need to tell Tidy to output XHTML.  Then it should
be valid for XML processing.

Of course BeautifulSoup is also a very nice library if you need to
extract some information, but don't necessarilly require XML processing
to do it.

-- Matt Good

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-01 Thread Paul Boddie
Ravi Teja wrote:
>
> 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> web pages in general.

import libxml2dom
import urllib
f = urllib.urlopen("http://wiki.python.org/moin/";)
s = f.read()
f.close()
# s contains HTML not XML text
d = libxml2dom.parseString(s, html=1)
# get the community-related links
for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
print label.nodeValue

Of course, lxml should be able to do this kind of thing as well. I'd be
interested to know why this "is not a good idea", though.

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-01 Thread Fredrik Lundh
bruce wrote:

> that's exactly what i'm trying to accomplish... i've used tidy, but it seems
> to still generate warnings...
> 
>  initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)
> 
> the xpath/linxml functions in the perl app complain regarding the file.

what exactly do they complain about ?



-- 
http://mail.python.org/mailman/listinfo/python-list


RE: beautifulsoup .vs tidy

2006-07-01 Thread bruce
hi paddy...

that's exactly what i'm trying to accomplish... i've used tidy, but it seems
to still generate warnings...

 initFile -> tidy ->cleanFile -> perl app (using xpath/livxml)

the xpath/linxml functions in the perl app complain regarding the file. my
thought is that tidy isn't cleaning enough, or that the perl xpath/libxml
functions are too strict!

which is why i decided to see if anyone on the python side has
experienced/solved this problem..

-bruce


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf
Of Paddy
Sent: Saturday, July 01, 2006 1:09 AM
To: python-list@python.org
Subject: Re: beautifulsoup .vs tidy



bruce wrote:
> hi...
>
> never used perl, but i have an issue trying to resolve some html that
> appears to be "dirty/malformed" regarding the overall structure. in
> researching validators, i came across the beautifulsoup app and wanted to
> know if anybody could give me pros/cons of the app as it relates to any of
> the other validation apps...
>
I'm not too sure of what you are after. You mention tidy in the subject
which made me think that maybe you were trying to generate well-formed
HTML from malformed webppages that nonetheless browsers can interpret.
If that is the case then try HTML tidy:
  http://www.w3.org/People/Raggett/tidy/

- Pad.

--
http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-07-01 Thread Paddy

bruce wrote:
> hi...
>
> never used perl, but i have an issue trying to resolve some html that
> appears to be "dirty/malformed" regarding the overall structure. in
> researching validators, i came across the beautifulsoup app and wanted to
> know if anybody could give me pros/cons of the app as it relates to any of
> the other validation apps...
>
I'm not too sure of what you are after. You mention tidy in the subject
which made me think that maybe you were trying to generate well-formed
HTML from malformed webppages that nonetheless browsers can interpret.
If that is the case then try HTML tidy:
  http://www.w3.org/People/Raggett/tidy/

- Pad.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: beautifulsoup .vs tidy

2006-06-30 Thread Ravi Teja
bruce wrote:
> hi...
>
> never used perl, but i have an issue trying to resolve some html that
> appears to be "dirty/malformed" regarding the overall structure. in
> researching validators, i came across the beautifulsoup app and wanted to
> know if anybody could give me pros/cons of the app as it relates to any of
> the other validation apps...
>
> the issue i'm facing involves parsing some websites, so i'm trying to
> extract information based on the DOM/XPath functions.. i'm using perl to
> handle the extraction

1.) XPath is not a good idea at all with "malformed" HTML or perhaps
web pages in general.
2.) BeautifulSoup is not a validator but works well with bad HTML. Also
look at Mechanize and ClientForm.
3.) XMLStarlet is a good XML validator
(http://xmlstar.sourceforge.net/). It's not Python but you don't need
to care about the language it is written in.
4.) For a simple HTML validator, Just use http://validator.w3.org/

-- 
http://mail.python.org/mailman/listinfo/python-list


beautifulsoup .vs tidy

2006-06-30 Thread bruce
hi...

never used perl, but i have an issue trying to resolve some html that
appears to be "dirty/malformed" regarding the overall structure. in
researching validators, i came across the beautifulsoup app and wanted to
know if anybody could give me pros/cons of the app as it relates to any of
the other validation apps...

the issue i'm facing involves parsing some websites, so i'm trying to
extract information based on the DOM/XPath functions.. i'm using perl to
handle the extraction

thanks

-bruce
[EMAIL PROTECTED]

-- 
http://mail.python.org/mailman/listinfo/python-list