Tatsuhiko Miyagawa wrote:
> [Web::Scraper]
> The github master version just took it out the big fat warning a few
> days ago, ready to be shipped to CPAN soon with more POD docs :)
I also just finished writing an application which uses it a lot, so I'm
planning to write an article about how I did
On Apr 1, 2009, at 09:11 , mirod wrote:
The only problem I found was with tags like '' which gets
output by the as_XML method as '', which is not quite
well-formed XML. This doesn't prevent you from using XPath on it
with HTML::TreeBuilder::XPath though.
It's more than "not quite well-form
On Wed, Apr 1, 2009 at 6:21 PM, Toby Wintermute wrote:
> Thanks, Web::Scraper looks quite neat.
> However I want to avoid applications breaking on random CPAN module
> upgrades (as just happened with the XML::LibXML upgrade yesterday), so
> I might steer clear of it until it loses the big, bold w
2009/4/2 mirod :
> Tatsuhiko Miyagawa wrote:
>>
>> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute
>> wrote:
>>>
>>> The problem occurs when the html contains (the commonly used) & symbol
>>> within attributes, such as:
>>>
[snip]
>
> Indeed when I tested the various ways to get XML from HTML,
2009/4/1 Tatsuhiko Miyagawa :
> On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote:
>> The problem occurs when the html contains (the commonly used) & symbol
>> within attributes, such as:
>>
>>
>> I know that really one should escape the ampersand in those
>> circumstances, however real-wor
2009/4/1 Tatsuhiko Miyagawa :
> On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross wrote:
>>> I know that really one should escape the ampersand in those
>>> circumstances, however real-world web-pages rarely do this.. And this
>>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>>> versi
Tatsuhiko Miyagawa wrote:
On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote:
The problem occurs when the html contains (the commonly used) & symbol
within attributes, such as:
I know that really one should escape the ampersand in those
circumstances, however real-world web-pages rarely
On 1 Apr 2009, at 06:45, Toby Wintermute wrote:
Alternatively.. what do YOU use to parse real-world websites that are
often not totally valid?
If it's a quick hack I'll use HTML::Tidy like so:
my $tidy = HTML::Tidy->new({
output_xhtml => 1,
numeric_entities => 1,
});
$tidy->ignore(
Quoting Dave Cross :
Toby Wintermute wrote:
What you're trying to parse isn't XML. Therefore you shouldn't expect
to be able to parse it with an XML parser.
Alternatively.. what do YOU use to parse real-world websites that are
often not totally valid?
A similar problem is when writing an XML e
On Wed, Apr 1, 2009 at 10:53 AM, Dave Cross wrote:
> Or, alternatively, you could try the (badly named) XML::Liberal which parses
> stuff that isn't really XML.
This module has just been re-released as XML::Liberal::Neo with the
explicit design goal of fostering the development of XML-like forma
On Wed, Apr 1, 2009 at 2:53 AM, Dave Cross wrote:
>> I know that really one should escape the ampersand in those
>> circumstances, however real-world web-pages rarely do this.. And this
>> behaviour was tolerated in XML::LibXML 1.66, just not subsequent
>> versions.. but eh, maybe it's just the wa
On Tue, Mar 31, 2009 at 10:45 PM, Toby Wintermute wrote:
> The problem occurs when the html contains (the commonly used) & symbol
> within attributes, such as:
>
>
> I know that really one should escape the ampersand in those
> circumstances, however real-world web-pages rarely do this.. And this
Toby Wintermute wrote:
I know that really one should escape the ampersand in those
circumstances, however real-world web-pages rarely do this.. And this
behaviour was tolerated in XML::LibXML 1.66, just not subsequent
versions.. but eh, maybe it's just the way I'm calling the parser?
Sounds li
On Wed, Apr 01, 2009 at 04:45:28PM +1100, Toby Wintermute wrote:
[...]
> I know that really one should escape the ampersand in those circumstances,
> however real-world web-pages rarely do this.. And this behaviour was
> tolerated in XML::LibXML 1.66, just not subsequent versions.. but eh,
> maybe
Hi,
I've been using XML::LibXML in the back-end of rea-toys[1] to scrape a
certain website for a while now, but noticed it all broke down when I
upgraded XML::LibXML from 1.66 to 1.69, and after some quick testing I
narrowed the change down to being between version 1.66 and 1.67.
My first instinct
15 matches
Mail list logo