Re: beautifulsoup .vs tidy
bruce wrote: > hi paddy... > > that's exactly what i'm trying to accomplish... i've used tidy, but it seems > to still generate warnings... > > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml) > > the xpath/linxml functions in the perl app complain regarding the file. my > thought is that tidy isn't cleaning enough, or that the perl xpath/libxml > functions are too strict! > > which is why i decided to see if anyone on the python side has > experienced/solved this problem.. FWIW here's my usual approach: http://copia.ogbuji.net/blog/2005-07-22/Beyond_HTM Personally, I avoid Tidy. I've too often seen it crash or hang on really bad HTML. TagSoup seems to be built like a tank. I've also never seen BeautifulSoup choke, but I don't use it as much as TagSoup. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.nethttp://fourthought.com http://copia.ogbuji.net http://4Suite.org Articles: http://uche.ogbuji.net/tech/publications/ -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
Ravi Teja wrote: >> Of course, lxml should be able to do this kind of thing as well. I'd be >> interested to know why this "is not a good idea", though. > > No reason that you don't know already. > > http://www.boddie.org.uk/python/HTML.html > > "If the document text is well-formed XML, we could omit the html > parameter or set it to have a false value." > > XML parsers are not required to be forgiving to be regarded compliant. > And much HTML out there is not well formed. so? once you run it through an HTML-aware parser, the *resulting* structure is well formed. a site generator->converter->xpath approach is no less reliable than any other HTML-scraping approach. -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
Paul Boddie wrote: > Ravi Teja wrote: > > > > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps > > web pages in general. > > import libxml2dom > import urllib > f = urllib.urlopen("http://wiki.python.org/moin/";) > s = f.read() > f.close() > # s contains HTML not XML text > d = libxml2dom.parseString(s, html=1) > # get the community-related links > for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"): > print label.nodeValue I wasn't aware that your module does html as well. > Of course, lxml should be able to do this kind of thing as well. I'd be > interested to know why this "is not a good idea", though. No reason that you don't know already. http://www.boddie.org.uk/python/HTML.html "If the document text is well-formed XML, we could omit the html parameter or set it to have a false value." XML parsers are not required to be forgiving to be regarded compliant. And much HTML out there is not well formed. -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
bruce wrote: > that's exactly what i'm trying to accomplish... i've used tidy, but it seems > to still generate warnings... > > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml) > > the xpath/linxml functions in the perl app complain regarding the file. my > thought is that tidy isn't cleaning enough, or that the perl xpath/libxml > functions are too strict! Clean HTML is not valid XML. If you want to process the output with an XML library you'll need to tell Tidy to output XHTML. Then it should be valid for XML processing. Of course BeautifulSoup is also a very nice library if you need to extract some information, but don't necessarilly require XML processing to do it. -- Matt Good -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
Ravi Teja wrote: > > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps > web pages in general. import libxml2dom import urllib f = urllib.urlopen("http://wiki.python.org/moin/";) s = f.read() f.close() # s contains HTML not XML text d = libxml2dom.parseString(s, html=1) # get the community-related links for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"): print label.nodeValue Of course, lxml should be able to do this kind of thing as well. I'd be interested to know why this "is not a good idea", though. Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
bruce wrote: > that's exactly what i'm trying to accomplish... i've used tidy, but it seems > to still generate warnings... > > initFile -> tidy ->cleanFile -> perl app (using xpath/livxml) > > the xpath/linxml functions in the perl app complain regarding the file. what exactly do they complain about ? -- http://mail.python.org/mailman/listinfo/python-list
RE: beautifulsoup .vs tidy
hi paddy... that's exactly what i'm trying to accomplish... i've used tidy, but it seems to still generate warnings... initFile -> tidy ->cleanFile -> perl app (using xpath/livxml) the xpath/linxml functions in the perl app complain regarding the file. my thought is that tidy isn't cleaning enough, or that the perl xpath/libxml functions are too strict! which is why i decided to see if anyone on the python side has experienced/solved this problem.. -bruce -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Paddy Sent: Saturday, July 01, 2006 1:09 AM To: python-list@python.org Subject: Re: beautifulsoup .vs tidy bruce wrote: > hi... > > never used perl, but i have an issue trying to resolve some html that > appears to be "dirty/malformed" regarding the overall structure. in > researching validators, i came across the beautifulsoup app and wanted to > know if anybody could give me pros/cons of the app as it relates to any of > the other validation apps... > I'm not too sure of what you are after. You mention tidy in the subject which made me think that maybe you were trying to generate well-formed HTML from malformed webppages that nonetheless browsers can interpret. If that is the case then try HTML tidy: http://www.w3.org/People/Raggett/tidy/ - Pad. -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
bruce wrote: > hi... > > never used perl, but i have an issue trying to resolve some html that > appears to be "dirty/malformed" regarding the overall structure. in > researching validators, i came across the beautifulsoup app and wanted to > know if anybody could give me pros/cons of the app as it relates to any of > the other validation apps... > I'm not too sure of what you are after. You mention tidy in the subject which made me think that maybe you were trying to generate well-formed HTML from malformed webppages that nonetheless browsers can interpret. If that is the case then try HTML tidy: http://www.w3.org/People/Raggett/tidy/ - Pad. -- http://mail.python.org/mailman/listinfo/python-list
Re: beautifulsoup .vs tidy
bruce wrote: > hi... > > never used perl, but i have an issue trying to resolve some html that > appears to be "dirty/malformed" regarding the overall structure. in > researching validators, i came across the beautifulsoup app and wanted to > know if anybody could give me pros/cons of the app as it relates to any of > the other validation apps... > > the issue i'm facing involves parsing some websites, so i'm trying to > extract information based on the DOM/XPath functions.. i'm using perl to > handle the extraction 1.) XPath is not a good idea at all with "malformed" HTML or perhaps web pages in general. 2.) BeautifulSoup is not a validator but works well with bad HTML. Also look at Mechanize and ClientForm. 3.) XMLStarlet is a good XML validator (http://xmlstar.sourceforge.net/). It's not Python but you don't need to care about the language it is written in. 4.) For a simple HTML validator, Just use http://validator.w3.org/ -- http://mail.python.org/mailman/listinfo/python-list
beautifulsoup .vs tidy
hi... never used perl, but i have an issue trying to resolve some html that appears to be "dirty/malformed" regarding the overall structure. in researching validators, i came across the beautifulsoup app and wanted to know if anybody could give me pros/cons of the app as it relates to any of the other validation apps... the issue i'm facing involves parsing some websites, so i'm trying to extract information based on the DOM/XPath functions.. i'm using perl to handle the extraction thanks -bruce [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list