Re: How to login and then start a CrawlSpider with rules?

Paul Tremberth Wed, 26 Mar 2014 14:47:02 -0700

HI,

I recently posted an answer on StackOverflow with a way to combine login 
and CrawlSpider:
http://stackoverflow.com/a/22569515/2572383


Feedback is welcome.

/Paul.

On Wednesday, March 26, 2014 10:37:33 PM UTC+1, Karen Oganesyan wrote:
>
> Hi everyone,
> I faced with same situation (login and crawling with crawlspider) and done 
> it with overridden parse_start_url (all code below should placed in 
> your-spider-file.py):
>
> # Here you setting the start page, where your spider can login
> start_urls = ["http://forums.website.com";]
>
> # Here you override the function to login on website and setting the 
> callback function to check is everything ok after your login
> def parse_start_url(self, response):
> return [FormRequest.from_response(response,
> formdata={'login': 'myUsername', 'password': 'myPassword'}, 
> callback=self.after_login)]
>
> # Here you doing after-login stuff to check is everything ok, and if it's, 
> making request object with real start page from where your spider can start 
> to crawl and parse
> def after_login(self, response):
> if "Incorrect login or password" in response.body:
> self.log("### Login failed ###", level=log.ERROR)
> exit()
> else:
> self.log("### Successfully logged in! ###")
> lnk = 'http://website.com/realstartpage.php'
> request = Request(lnk)
> return request
>
> To make it work, don't forget to import request modules in the beggining 
> of your spider file:
> from scrapy.http import Request, FormRequest
>
> Hope it helps someone
>
>
>
> среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал:
>>
>> it's documented in 0.17: 
>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url
>>  
>>
>> Em quinta-feira, 11 de julho de 2013 17h02min31s UTC-3, Paul Tremberth 
>> escreveu:
>>>
>>> Hi
>>> CrawlSpider has an overridable method parse_start_url() that could be 
>>> used in your case (I think)
>>>
>>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url
>>>
>>> It's not mentioned in the docs for 0.16 (the links your provided) but 
>>> it's in the code for 0.16 and 0.17
>>>
>>> https://github.com/scrapy/scrapy/blob/0.16/scrapy/contrib/spiders/crawl.py
>>>
>>> It's called in CrawlSpider's parse() method, so when the first URL is 
>>> fetched and processed (especially the start_urls you will define for your 
>>> LoginSpider).
>>>
>>> So I would try and define parse_start_url() just as the LoginSpider 
>>> example
>>>
>>>     def parse_start_url(self, response):
>>>         return [FormRequest.from_response(response,
>>>                     formdata={'username': 'john', 'password': 'secret'},
>>>                     callback=self.after_login)]
>>>
>>>
>>> *Note: as another user in the group recently had issues with this 
>>> parse_start_url() method being called several times,*
>>> *be sure to define a callback that is NOT parse() for your Rules()*
>>>
>>> Tell us how it goes.
>>>
>>> Paul.
>>>
>>> On Thursday, July 11, 2013 7:48:57 PM UTC+2, Fer wrote:
>>>>
>>>> Hi everyone!
>>>> I'm trying to mix the 
>>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin>
>>>>  with 
>>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>,
>>>>  
>>>> but I do not find a way. The idea is first login and then parse using the 
>>>> rules, but in the example of 
>>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin>
>>>>  the 
>>>> method parse was modified, and in the 
>>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>
>>>>  says 
>>>> "if you override the parse method, the crawl spider will no longer work".  
>>>> I 
>>>> would be grateful if you could help me.
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How to login and then start a CrawlSpider with rules?

Reply via email to