Hi Michele, I've been investigating further in your problem and looks like the html in http://www.mass.gov/eea/agencies/dfg/der/ is malformed. You can see here what part of the html is really reaching extract_links: http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). This page has 4 html definitions.
Hope this helps, Kind Regards, Rocio On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]> wrote: > > By doing some debugging in ipdb I found out that the extract_links > function in the class LxmlLinkExtractor is not getting the same data I > see in the scrapy shell. While in the scrapy shell I see the correct data > inside the <body> tag, when I see at the html variable in extract_links I > see: > > \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t > > I *know* that both the scrapy shell and my script are getting the very > same data from the server (checked with wireshark). So somewhere in between > the fetching of the data and the extract_links function, the content of the > body disappears. > > Someone with knowledge about the source code can tell me which function > calls LxmlLinkExtractor's extract_links? > > Thanks! > Michele C > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
