Ok, but where? In the CrawlSpider? Should I basically override the parse() function? Can I still use my rule in there, and if so, how? Thanks! Michele C
Il giorno mercoledì 5 novembre 2014 09:55:28 UTC-5, Aru Sahni ha scritto: > > You can just invoke BeautifulSoup as one normally would and not use > Scrapy's built-in functionality. > > ~A > > On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected] > <javascript:>> wrote: > >> Bingo, that's it, you are great. >> So it is what exits from the Selector(response) that is the problem, >> because response contains the entire malformed html (as it should). >> >> I tried a little test, feeding the malformed html to Beautiful soup: lxml >> parser still fails, html5lib instead parses correctly. So, the question is: >> how do I use html5lib's parser instead of lxml in Scrapy? The >> documentation >> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml> >> >> tells me that "you can easily use BeautifulSoup >> <http://www.crummy.com/software/BeautifulSoup/> (or lxml >> <http://lxml.de/>) instead", but it doesn't say how :-) >> >> Finally: I'd dare to say that this is a bug and it should be reported as >> such. If any browser and html5lib can parse the page, then so should >> Scrapy. Do you think I should submit it on the Github page? >> >> Thanks, you have been already very helpful! >> Michele C >> >> >> >> >> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha >> scritto: >>> >>> Hi Michele, >>> >>> I've been investigating further in your problem and looks like the html >>> in http://www.mass.gov/eea/agencies/dfg/der/ is malformed. You can see >>> here what part of the html is really reaching extract_links: http:// >>> pastebin.com/6kTT5Amt (there is an </html> at the end of it). This page >>> has 4 html definitions. >>> >>> Hope this helps, >>> Kind Regards, >>> Rocio >>> >>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]> >>> wrote: >>> >>>> >>>> By doing some debugging in ipdb I found out that the extract_links >>>> function in the class LxmlLinkExtractor is not getting the same data I >>>> see in the scrapy shell. While in the scrapy shell I see the correct data >>>> inside the <body> tag, when I see at the html variable in extract_links >>>> I see: >>>> >>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t >>>> >>>> I *know* that both the scrapy shell and my script are getting the very >>>> same data from the server (checked with wireshark). So somewhere in >>>> between >>>> the fetching of the data and the extract_links function, the content of >>>> the >>>> body disappears. >>>> >>>> Someone with knowledge about the source code can tell me which function >>>> calls LxmlLinkExtractor's extract_links? >>>> >>>> Thanks! >>>> Michele C >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
