Re: CrawlSpider fails to follow rule for some websites

Michele Coscia Tue, 04 Nov 2014 14:54:07 -0800

By doing some debugging in ipdb I found out that the extract_links function 
in the class LxmlLinkExtractor is not getting the same data I see in the 
scrapy shell. While in the scrapy shell I see the correct data inside the 
<body> tag, when I see at the html variable in extract_links I see:


\r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t

I *know* that both the scrapy shell and my script are getting the very same 
data from the server (checked with wireshark). So somewhere in between the 
fetching of the data and the extract_links function, the content of the 
body disappears.

Someone with knowledge about the source code can tell me which function 
calls LxmlLinkExtractor's extract_links?

Thanks!
Michele C

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to