What if I do need to use CrawlSpider? After all I *was* using the Rule functionality, plus several other things that are needed in my script and that I cannot pass as simple arguments to "scrapy crawl xxx"? Thanks! Michele C
Il giorno mercoledì 5 novembre 2014 10:15:31 UTC-5, Travis Leleu ha scritto: > > I would recommend not using the CrawlSpider class if you're not using the > Rule functionality. Just use a normal scrapy.Spider, and then override the > parse() class like you said (then obviously you have to build the logic to > identify the links to follow). > > When you're writing your parse function, one neat thing: you can yield > items, and they get processed through the item pipeline, or you can yield > requests, and they get added to the request queue. > > > On Wed, Nov 5, 2014 at 7:02 AM, Michele Coscia <[email protected] > <javascript:>> wrote: > >> Ok, but where? In the CrawlSpider? Should I basically override the >> parse() function? Can I still use my rule in there, and if so, how? >> Thanks! >> Michele C >> >> Il giorno mercoledì 5 novembre 2014 09:55:28 UTC-5, Aru Sahni ha scritto: >>> >>> You can just invoke BeautifulSoup as one normally would and not use >>> Scrapy's built-in functionality. >>> >>> ~A >>> >>> On Wed, Nov 5, 2014 at 9:51 AM, Michele Coscia <[email protected]> >>> wrote: >>> >>>> Bingo, that's it, you are great. >>>> So it is what exits from the Selector(response) that is the problem, >>>> because response contains the entire malformed html (as it should). >>>> >>>> I tried a little test, feeding the malformed html to Beautiful soup: >>>> lxml parser still fails, html5lib instead parses correctly. So, the >>>> question is: how do I use html5lib's parser instead of lxml in Scrapy? The >>>> documentation >>>> <http://doc.scrapy.org/en/latest/faq.html#how-does-scrapy-compare-to-beautifulsoup-or-lxml> >>>> >>>> tells me that "you can easily use BeautifulSoup >>>> <http://www.crummy.com/software/BeautifulSoup/> (or lxml >>>> <http://lxml.de/>) instead", but it doesn't say how :-) >>>> >>>> Finally: I'd dare to say that this is a bug and it should be reported >>>> as such. If any browser and html5lib can parse the page, then so should >>>> Scrapy. Do you think I should submit it on the Github page? >>>> >>>> Thanks, you have been already very helpful! >>>> Michele C >>>> >>>> >>>> >>>> >>>> Il giorno mercoledì 5 novembre 2014 06:20:26 UTC-5, Rocío Aramberri ha >>>> scritto: >>>>> >>>>> Hi Michele, >>>>> >>>>> I've been investigating further in your problem and looks like the >>>>> html in http://www.mass.gov/eea/agencies/dfg/der/ is malformed. You >>>>> can see here what part of the html is really reaching extract_links: >>>>> http://pastebin.com/6kTT5Amt (there is an </html> at the end of it). >>>>> This page has 4 html definitions. >>>>> >>>>> Hope this helps, >>>>> Kind Regards, >>>>> Rocio >>>>> >>>>> On Tue Nov 04 2014 at 8:53:36 PM Michele Coscia <[email protected]> >>>>> wrote: >>>>> >>>>>> >>>>>> By doing some debugging in ipdb I found out that the extract_links >>>>>> function in the class LxmlLinkExtractor is not getting the same data >>>>>> I see in the scrapy shell. While in the scrapy shell I see the correct >>>>>> data >>>>>> inside the <body> tag, when I see at the html variable in extract_links >>>>>> I see: >>>>>> >>>>>> \r\n\t\t<a id="top"></a>\r\n\t\t\t<!-- alert content here -->\t\t\t >>>>>> >>>>>> I *know* that both the scrapy shell and my script are getting the >>>>>> very same data from the server (checked with wireshark). So somewhere in >>>>>> between the fetching of the data and the extract_links function, the >>>>>> content of the body disappears. >>>>>> >>>>>> Someone with knowledge about the source code can tell me which >>>>>> function calls LxmlLinkExtractor's extract_links? >>>>>> >>>>>> Thanks! >>>>>> Michele C >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "scrapy-users" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To post to this group, send email to [email protected]. >>>>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
