Hi, I tested the URL with a different User Agent and it return 200, execute this in terminal:
1 - scrapy shell http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1 -s USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1' 2 - sel.xpath('//title//text()').extract() So to resolve the problem, put this line in settings.py: USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1' Regards. --------- Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy Email/Gtalk: [email protected] - Skype: baazzilhassan Blog: http://blog.jbinfo.io/ [image: Donate - PayPal -] <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate - PayPal - <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> 2014-06-24 16:02 GMT+01:00 Lhassan Baazzi <[email protected]>: > Hi, > > The amazon server detect that the request come from a bot, so it's return > 403 as HTTP status code => Execute access forbidden > Try to change User agent. > > > Regards. > --------- > Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy > Email/Gtalk: [email protected] - Skype: baazzilhassan > Blog: http://blog.jbinfo.io/ > [image: Donate - PayPal -] > <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate > - PayPal - > <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN> > > > 2014-06-24 15:13 GMT+01:00 Abhijeet Raj <[email protected]>: > > I have the following code to crawl some data but when i rum the spider it >> is not entering the parse function, >> The code is as below >> from scrapy.item import Item, Field >> from scrapy.selector import Selector >> from scrapy.spider import BaseSpider >> from scrapy.selector import HtmlXPathSelector >> >> >> class MyItem(Item): >> reviewer_ranking = Field() >> print "asdadsa" >> >> >> class MySpider(BaseSpider): >> name = 'myspider' >> domain_name = ["amazon.com/gp/pdp/profile"] >> start_urls = [" >> http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1"] >> print"*****" >> def parse(self, response): >> print"fggfggftgtr" >> sel = Selector(response) >> hxs = HtmlXPathSelector(response) >> item = MyItem() >> item["reviewer_ranking"] = >> hxs.select('//span[@class="a-size-small >> a-color-secondary"]/text()').extract() >> return item >> >> The output screen looks like this. >> >> asdadsa >> ***** >> /home/raj/Documents/IIM A/Daily sales rank/Daily >> reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18: >> ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from >> deprecated class scrapy.spider.BaseSpider, please inherit from >> scrapy.spider.Spider. (warning only on first subclass, there may be others) >> class MySpider(BaseSpider): >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot: >> scrapybot) >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Optional features available: ssl, >> http11 >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Overridden settings: {} >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled extensions: LogStats, >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >> HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled item pipelines: >> 2014-06-24 19:41:38+0530 [myspider] INFO: Spider opened >> 2014-06-24 19:41:38+0530 [myspider] INFO: Crawled 0 pages (at 0 >> pages/min), scraped 0 items (at 0 items/min) >> 2014-06-24 19:41:38+0530 [scrapy] DEBUG: Telnet console listening on >> 0.0.0.0:6027 >> 2014-06-24 19:41:38+0530 [scrapy] DEBUG: Web service listening on >> 0.0.0.0:6084 >> 2014-06-24 19:41:38+0530 [myspider] DEBUG: Crawled (403) <GET >> http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1> (referer: None) >> ['partial'] >> 2014-06-24 19:41:38+0530 [myspider] INFO: Closing spider (finished) >> 2014-06-24 19:41:38+0530 [myspider] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 242, >> 'downloader/request_count': 1, >> 'downloader/request_method_count/GET': 1, >> 'downloader/response_bytes': 28486, >> 'downloader/response_count': 1, >> 'downloader/response_status_count/403': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2014, 6, 24, 14, 11, 38, 696574), >> 'log_count/DEBUG': 3, >> 'log_count/INFO': 7, >> 'response_received_count': 1, >> 'scheduler/dequeued': 1, >> 'scheduler/dequeued/memory': 1, >> 'scheduler/enqueued': 1, >> 'scheduler/enqueued/memory': 1, >> 'start_time': datetime.datetime(2014, 6, 24, 14, 11, 38, 513615)} >> 2014-06-24 19:41:38+0530 [myspider] INFO: Spider closed (finished) >> >> >> Please help me out, i am stuck >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
