Re: CrawlSpider fails to follow rule for some websites

Rocío Aramberri Wed, 05 Nov 2014 03:20:41 -0800

Hi Michele,

I've been investigating further in your problem and looks like the html in
http://www.mass.gov/eea/agencies/dfg/der/ is malformed.  You can see here
what part of the html is really reaching extract_links:
http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). This
page has 4 html definitions.


Hope this helps,
Kind Regards,
Rocio

On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]>
wrote:

>
> By doing some debugging in ipdb I found out that the extract_links
> function in the class LxmlLinkExtractor is not getting the same data I
> see in the scrapy shell. While in the scrapy shell I see the correct data
> inside the <body> tag, when I see at the html variable in extract_links I
> see:
>
> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t
>
> I *know* that both the scrapy shell and my script are getting the very
> same data from the server (checked with wireshark). So somewhere in between
> the fetching of the data and the extract_links function, the content of the
> body disappears.
>
> Someone with knowledge about the source code can tell me which function
> calls LxmlLinkExtractor's extract_links?
>
> Thanks!
> Michele C
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: CrawlSpider fails to follow rule for some websites

Reply via email to