Re: Scrapy not entering parse methods, Help please

Lhassan Baazzi Tue, 24 Jun 2014 08:10:35 -0700

Hi,

I tested the URL with a different User Agent and it return 200, execute
this in terminal:


1 - scrapy shell http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1 -s
USER_AGENT='Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML,
like Gecko) Chrome/22.0.1207.1 Safari/537.1'

2 - sel.xpath('//title//text()').extract()

So to resolve the problem, put this line in settings.py:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML,
like Gecko) Chrome/22.0.1207.1 Safari/537.1'



Regards.
---------
Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
Email/Gtalk: [email protected] - Skype: baazzilhassan
Blog: http://blog.jbinfo.io/
[image: Donate - PayPal -]
<https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
- PayPal -
<https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>


2014-06-24 16:02 GMT+01:00 Lhassan Baazzi <[email protected]>:

> Hi,
>
> The amazon server detect that the request come from a bot, so it's return
> 403 as HTTP status code => Execute access forbidden
> Try to change User agent.
>
>
> Regards.
> ---------
> Lhassan Baazzi | Web Developer PHP - Symfony - JS - Scrapy
> Email/Gtalk: [email protected] - Skype: baazzilhassan
> Blog: http://blog.jbinfo.io/
> [image: Donate - PayPal -]
> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>Donate
> - PayPal -
> <https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BR744DG33RAGN>
>
>
> 2014-06-24 15:13 GMT+01:00 Abhijeet Raj <[email protected]>:
>
> I have the following code to crawl some data but when i rum the spider it
>> is not entering the parse function,
>> The code is as below
>>       from scrapy.item import Item, Field
>>       from scrapy.selector import Selector
>>       from scrapy.spider import BaseSpider
>>       from scrapy.selector import HtmlXPathSelector
>>
>>
>>       class MyItem(Item):
>>           reviewer_ranking = Field()
>>           print "asdadsa"
>>
>>
>>       class MySpider(BaseSpider):
>>           name = 'myspider'
>>           domain_name = ["amazon.com/gp/pdp/profile"]
>>           start_urls = ["
>> http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1";]
>>           print"*****"
>>           def parse(self, response):
>>               print"fggfggftgtr"
>>               sel = Selector(response)
>>               hxs = HtmlXPathSelector(response)
>>               item = MyItem()
>>               item["reviewer_ranking"] =
>> hxs.select('//span[@class="a-size-small
>> a-color-secondary"]/text()').extract()
>>               return item
>>
>> The output screen looks like this.
>>
>> asdadsa
>> *****
>> /home/raj/Documents/IIM A/Daily sales rank/Daily
>> reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18:
>> ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from
>> deprecated class scrapy.spider.BaseSpider, please inherit from
>> scrapy.spider.Spider. (warning only on first subclass, there may be others)
>>   class MySpider(BaseSpider):
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot:
>> scrapybot)
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Optional features available: ssl,
>> http11
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Overridden settings: {}
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled extensions: LogStats,
>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled downloader middlewares:
>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
>> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
>> HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled spider middlewares:
>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
>> UrlLengthMiddleware, DepthMiddleware
>> 2014-06-24 19:41:38+0530 [scrapy] INFO: Enabled item pipelines:
>> 2014-06-24 19:41:38+0530 [myspider] INFO: Spider opened
>> 2014-06-24 19:41:38+0530 [myspider] INFO: Crawled 0 pages (at 0
>> pages/min), scraped 0 items (at 0 items/min)
>> 2014-06-24 19:41:38+0530 [scrapy] DEBUG: Telnet console listening on
>> 0.0.0.0:6027
>> 2014-06-24 19:41:38+0530 [scrapy] DEBUG: Web service listening on
>> 0.0.0.0:6084
>> 2014-06-24 19:41:38+0530 [myspider] DEBUG: Crawled (403) <GET
>> http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1> (referer: None)
>> ['partial']
>> 2014-06-24 19:41:38+0530 [myspider] INFO: Closing spider (finished)
>> 2014-06-24 19:41:38+0530 [myspider] INFO: Dumping Scrapy stats:
>> {'downloader/request_bytes': 242,
>>  'downloader/request_count': 1,
>>  'downloader/request_method_count/GET': 1,
>>  'downloader/response_bytes': 28486,
>>  'downloader/response_count': 1,
>>  'downloader/response_status_count/403': 1,
>>  'finish_reason': 'finished',
>>  'finish_time': datetime.datetime(2014, 6, 24, 14, 11, 38, 696574),
>>  'log_count/DEBUG': 3,
>>  'log_count/INFO': 7,
>>  'response_received_count': 1,
>>  'scheduler/dequeued': 1,
>>  'scheduler/dequeued/memory': 1,
>>  'scheduler/enqueued': 1,
>>  'scheduler/enqueued/memory': 1,
>>  'start_time': datetime.datetime(2014, 6, 24, 14, 11, 38, 513615)}
>> 2014-06-24 19:41:38+0530 [myspider] INFO: Spider closed (finished)
>>
>>
>> Please help me out, i am stuck
>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy not entering parse methods, Help please

Reply via email to