Re: XML::LibXML and HTML (in >=v1.67)

2009-04-04 Thread Simon Cozens
Tatsuhiko Miyagawa wrote: > [Web::Scraper] > The github master version just took it out the big fat warning a few > days ago, ready to be shipped to CPAN soon with more POD docs :) I also just finished writing an application which uses it a lot, so I'm planning to write an article about how I did

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-03 Thread Robin Berjon
On Apr 1, 2009, at 09:11 , mirod wrote: The only problem I found was with tags like '' which gets output by the as_XML method as '', which is not quite well-formed XML. This doesn't prevent you from using XPath on it with HTML::TreeBuilder::XPath though. It's more than "not quite well-form

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Tatsuhiko Miyagawa
On Wed, Apr 1, 2009 at 6:21 PM, Toby Wintermute wrote: > Thanks, Web::Scraper looks quite neat. > However I want to avoid applications breaking on random CPAN module > upgrades (as just happened with the XML::LibXML upgrade yesterday), so > I might steer clear of it until it loses the big, bold w

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Toby Wintermute
2009/4/2 mirod : > Tatsuhiko Miyagawa wrote: >> >> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute >> wrote: >>> >>> The problem occurs when the html contains (the commonly used) & symbol >>> within attributes, such as: >>> [snip] > > Indeed when I tested the various ways to get XML from HTML,

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Toby Wintermute
2009/4/1 Tatsuhiko Miyagawa : > On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote: >> The problem occurs when the html contains (the commonly used) & symbol >> within attributes, such as: >> >> >> I know that really one should escape the ampersand in those >> circumstances, however real-wor

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Toby Wintermute
2009/4/1 Tatsuhiko Miyagawa : > On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross wrote: >>> I know that really one should escape the ampersand in those >>> circumstances, however real-world web-pages rarely do this.. And this >>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent >>> versi

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread mirod
Tatsuhiko Miyagawa wrote: On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote: The problem occurs when the html contains (the commonly used) & symbol within attributes, such as: I know that really one should escape the ampersand in those circumstances, however real-world web-pages rarely

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Pedro Figueiredo
On 1 Apr 2009, at 06:45, Toby Wintermute wrote: Alternatively.. what do YOU use to parse real-world websites that are often not totally valid? If it's a quick hack I'll use HTML::Tidy like so: my $tidy = HTML::Tidy->new({ output_xhtml => 1, numeric_entities => 1, }); $tidy->ignore(

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread peter
Quoting Dave Cross : Toby Wintermute wrote: What you're trying to parse isn't XML. Therefore you shouldn't expect to be able to parse it with an XML parser. Alternatively.. what do YOU use to parse real-world websites that are often not totally valid? A similar problem is when writing an XML e

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Paul Makepeace
On Wed, Apr 1, 2009 at 10:53 AM, Dave Cross wrote: > Or, alternatively, you could try the (badly named) XML::Liberal which parses > stuff that isn't really XML. This module has just been re-released as XML::Liberal::Neo with the explicit design goal of fostering the development of XML-like forma

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Tatsuhiko Miyagawa
On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross wrote: >> I know that really one should escape the ampersand in those >> circumstances, however real-world web-pages rarely do this.. And this >> behaviour was tolerated in XML::LibXML 1.66, just not subsequent >> versions.. but eh, maybe it's just the wa

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Tatsuhiko Miyagawa
On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote: > The problem occurs when the html contains (the commonly used) & symbol > within attributes, such as: > > > I know that really one should escape the ampersand in those > circumstances, however real-world web-pages rarely do this.. And this

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Dave Cross
Toby Wintermute wrote: I know that really one should escape the ampersand in those circumstances, however real-world web-pages rarely do this.. And this behaviour was tolerated in XML::LibXML 1.66, just not subsequent versions.. but eh, maybe it's just the way I'm calling the parser? Sounds li

Re: XML::LibXML and HTML (in >=v1.67)

2009-04-01 Thread Peter Corlett
On Wed, Apr 01, 2009 at 04:45:28PM +1100, Toby Wintermute wrote: [...] > I know that really one should escape the ampersand in those circumstances, > however real-world web-pages rarely do this.. And this behaviour was > tolerated in XML::LibXML 1.66, just not subsequent versions.. but eh, > maybe

XML::LibXML and HTML (in >=v1.67)

2009-03-31 Thread Toby Wintermute
Hi, I've been using XML::LibXML in the back-end of rea-toys[1] to scrape a certain website for a while now, but noticed it all broke down when I upgraded XML::LibXML from 1.66 to 1.69, and after some quick testing I narrowed the change down to being between version 1.66 and 1.67. My first instinct