HI, I recently posted an answer on StackOverflow with a way to combine login and CrawlSpider: http://stackoverflow.com/a/22569515/2572383
Feedback is welcome. /Paul. On Wednesday, March 26, 2014 10:37:33 PM UTC+1, Karen Oganesyan wrote: > > Hi everyone, > I faced with same situation (login and crawling with crawlspider) and done > it with overridden parse_start_url (all code below should placed in > your-spider-file.py): > > # Here you setting the start page, where your spider can login > start_urls = ["http://forums.website.com"] > > # Here you override the function to login on website and setting the > callback function to check is everything ok after your login > def parse_start_url(self, response): > return [FormRequest.from_response(response, > formdata={'login': 'myUsername', 'password': 'myPassword'}, > callback=self.after_login)] > > # Here you doing after-login stuff to check is everything ok, and if it's, > making request object with real start page from where your spider can start > to crawl and parse > def after_login(self, response): > if "Incorrect login or password" in response.body: > self.log("### Login failed ###", level=log.ERROR) > exit() > else: > self.log("### Successfully logged in! ###") > lnk = 'http://website.com/realstartpage.php' > request = Request(lnk) > return request > > To make it work, don't forget to import request modules in the beggining > of your spider file: > from scrapy.http import Request, FormRequest > > Hope it helps someone > > > > среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал: >> >> it's documented in 0.17: >> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url >> >> >> Em quinta-feira, 11 de julho de 2013 17h02min31s UTC-3, Paul Tremberth >> escreveu: >>> >>> Hi >>> CrawlSpider has an overridable method parse_start_url() that could be >>> used in your case (I think) >>> >>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url >>> >>> It's not mentioned in the docs for 0.16 (the links your provided) but >>> it's in the code for 0.16 and 0.17 >>> >>> https://github.com/scrapy/scrapy/blob/0.16/scrapy/contrib/spiders/crawl.py >>> >>> It's called in CrawlSpider's parse() method, so when the first URL is >>> fetched and processed (especially the start_urls you will define for your >>> LoginSpider). >>> >>> So I would try and define parse_start_url() just as the LoginSpider >>> example >>> >>> def parse_start_url(self, response): >>> return [FormRequest.from_response(response, >>> formdata={'username': 'john', 'password': 'secret'}, >>> callback=self.after_login)] >>> >>> >>> *Note: as another user in the group recently had issues with this >>> parse_start_url() method being called several times,* >>> *be sure to define a callback that is NOT parse() for your Rules()* >>> >>> Tell us how it goes. >>> >>> Paul. >>> >>> On Thursday, July 11, 2013 7:48:57 PM UTC+2, Fer wrote: >>>> >>>> Hi everyone! >>>> I'm trying to mix the >>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin> >>>> with >>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>, >>>> >>>> but I do not find a way. The idea is first login and then parse using the >>>> rules, but in the example of >>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin> >>>> the >>>> method parse was modified, and in the >>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example> >>>> says >>>> "if you override the parse method, the crawl spider will no longer work". >>>> I >>>> would be grateful if you could help me. >>>> >>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
