By doing some debugging in ipdb I found out that the extract_links function in the class LxmlLinkExtractor is not getting the same data I see in the scrapy shell. While in the scrapy shell I see the correct data inside the <body> tag, when I see at the html variable in extract_links I see:
\r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t I *know* that both the scrapy shell and my script are getting the very same data from the server (checked with wireshark). So somewhere in between the fetching of the data and the extract_links function, the content of the body disappears. Someone with knowledge about the source code can tell me which function calls LxmlLinkExtractor's extract_links? Thanks! Michele C -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
