I'm trying to scrape some data from Amazon Mechanical Turkey. My parser can successfully crawl data from the first few pages of the results but it requires authentication to view the rest. It seems Amazon uses session cookies to identify users. I've searched around and tried various ways to login but still failed. Every time I get a `ap_error_page_cookieless_message` saying ' To continue shopping at Amazon.com, please enable cookies in your Web browser.'
I've enabled cookies in `settings.py` and by turning `COOKIES_DEBUG = True` I noticed that there do is cookies passing around. ``` DEBUG: Received cookies from: <302 https://www.mturk.com/mturk/beginsignin> Set-Cookie: mtuid=52890b4443158d8321eaaa29cb43b1; Domain=www.mturk.com; Path=/ Set-Cookie: worker_state=VFoyMHdlVTBGMWVMWVczZVZLaTRqMVpTN2Z3PTIwMTQwNzAzMDAzM1VzZXIudHVya1NlY3VyZX50cnVlJQ--; Expires=Fri, 03-Jul-2015 07:13:00 GMT; Secure ``` Here's what I got. I've modified InitSpider to inherit from CrawlSpider instead of BaseSpider for convenience. The problem still lies in login, so I think it wont matter which spider I'm using ```python # bunch of import class MySpider(InitSpider): #copy code from CrawlSpider to inherit from it name = "AMT" allowed_domains = ["mturk.com","amazon.com"] start_urls= ['https://www.mturk.com/mturk/viewhits?searchWords=&selectedSearchType=hitgroups&sortType=Title%3A0&pageNumber=1&searchSpec=HITGroupSearch%23T%232%2310%23-1%23T%23%21%23%21Title%210%21%23%21'] login_page = "https://www.mturk.com/mturk/beginsignin" formdata={'email': '123atfoo.com', 'create':'0','password': 'bar'} rules = ( Rule(LinkExtractor(allow=('pageNumber=', 'findhits?')), callback='parse_page'), Rule(LinkExtractor(allow=('checkregistration'))), # Rule(LinkExtractor(allow=('signin?', ),deny=('FORGOT','forgot', )), callback='login',follow=True), ) #override initial behavior def init_request(self): return Request(url=self.login_page, callback=self.login) def login(self, response): # send login request return FormRequest.from_response(response, formdata=self.formdata, callback=self.check_login_response) def check_login_response(self, response): # check login status if "Welcome" in response.body: self.log("\n\n\nSuccessfully logged in.\n\n\n") else: self.log("\n\n\nLogin failed\n\n\n") # start crawling anyway, which leads to start_urls return self.initialized() def parse_page(self, response): # my parse ``` Any suggestion how to successfully log into Amazon? Thanks -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
