Re: CrawlSpider fails to follow rule for some websites

Michele Coscia Wed, 05 Nov 2014 06:51:31 -0800

Bingo, that's it, you are great.
So it is what exits from the Selector(response) that is the problem, 
because response contains the entire malformed html (as it should).


I tried a little test, feeding the malformed html to Beautiful soup: lxml 
parser still fails, html5lib instead parses correctly. So, the question is: 
how do I use html5lib's parser instead of lxml in Scrapy? The documentation 
<http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml>
 
tells me that "you can easily use BeautifulSoup 
<http://www.crummy.com/software/BeautifulSoup/> (or lxml <http://lxml.de/>) 
instead", but it doesn't say how :-)

Finally: I'd dare to say that this is a bug and it should be reported as 
such. If any browser and html5lib can parse the page, then so should 
Scrapy. Do you think I should submit it on the Github page?

Thanks, you have been already very helpful!
Michele C




Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha 
scritto:
>
> Hi Michele,
>
> I've been investigating further in your problem and looks like the html in 
> http://www.mass.gov/eea/agencies/dfg/der/ is malformed.  You can see here 
> what part of the html is really reaching extract_links: 
> http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). This 
> page has 4 html definitions.
>
> Hope this helps,
> Kind Regards,
> Rocio
>
> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected] 
> <javascript:>> wrote:
>
>>
>> By doing some debugging in ipdb I found out that the extract_links 
>> function in the class LxmlLinkExtractor is not getting the same data I 
>> see in the scrapy shell. While in the scrapy shell I see the correct data 
>> inside the <body> tag, when I see at the html variable in extract_links 
>> I see:
>>
>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t
>>
>> I *know* that both the scrapy shell and my script are getting the very 
>> same data from the server (checked with wireshark). So somewhere in between 
>> the fetching of the data and the extract_links function, the content of the 
>> body disappears.
>>
>> Someone with knowledge about the source code can tell me which function 
>> calls LxmlLinkExtractor's extract_links?
>>
>> Thanks!
>> Michele C
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to