I'm trying to scrape some data from Amazon Mechanical Turkey. My parser can 
successfully crawl data from the first few pages of the results but it 
requires authentication to view the rest. It seems Amazon uses session 
cookies to identify users. I've searched around and tried various ways to 
login but still failed. Every time I get a 
`ap_error_page_cookieless_message` saying ' To continue shopping at 
Amazon.com, please enable cookies in your Web browser.'

I've enabled cookies in `settings.py` and by turning  `COOKIES_DEBUG = 
True` I noticed that there do is cookies passing around.
```
DEBUG: Received cookies from: <302 https://www.mturk.com/mturk/beginsignin>
Set-Cookie: mtuid=52890b4443158d8321eaaa29cb43b1; Domain=www.mturk.com; 
Path=/
Set-Cookie: 
worker_state=VFoyMHdlVTBGMWVMWVczZVZLaTRqMVpTN2Z3PTIwMTQwNzAzMDAzM1VzZXIudHVya1NlY3VyZX50cnVlJQ--;
 
Expires=Fri, 03-Jul-2015 07:13:00 GMT; Secure
```

Here's what I got. I've modified InitSpider to inherit from CrawlSpider 
instead of BaseSpider for convenience. The problem still lies in login, so 
I think it wont matter which spider I'm using

```python
# bunch of import
class MySpider(InitSpider):
    #copy code from CrawlSpider to inherit from it
    name = "AMT"
    allowed_domains = ["mturk.com","amazon.com"]
    start_urls= 
['https://www.mturk.com/mturk/viewhits?searchWords=&selectedSearchType=hitgroups&sortType=Title%3A0&pageNumber=1&searchSpec=HITGroupSearch%23T%232%2310%23-1%23T%23%21%23%21Title%210%21%23%21']
    login_page = "https://www.mturk.com/mturk/beginsignin";
    formdata={'email': '123atfoo.com', 'create':'0','password': 'bar'}
    rules = (
        Rule(LinkExtractor(allow=('pageNumber=', 'findhits?')), 
callback='parse_page'),
        Rule(LinkExtractor(allow=('checkregistration'))),
        # Rule(LinkExtractor(allow=('signin?', ),deny=('FORGOT','forgot', 
)), callback='login',follow=True),
    )

    #override initial behavior
    def init_request(self):
        return Request(url=self.login_page,
            callback=self.login)

    def login(self, response):
        # send login request
        return FormRequest.from_response(response,
                    formdata=self.formdata,
                    callback=self.check_login_response)

    def check_login_response(self, response):
        # check login status
        if "Welcome" in response.body:
            self.log("\n\n\nSuccessfully logged in.\n\n\n")
        else:
            self.log("\n\n\nLogin failed\n\n\n")
        # start crawling anyway, which leads to start_urls
        return self.initialized()

    def parse_page(self, response):
    # my parse
```

Any suggestion how to successfully log into Amazon?
Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to