You're welcome! Happy scraping :) On Friday, February 7, 2014 5:23:50 PM UTC+1, Marco Ippolito wrote: > > Hi Paul, > > thank you very much for your kind prompt helpful hint!!!..it works. > > I learned one more thing today: "the devil is in the details". > > Kind regards. > Marco > > > On Friday, 7 February 2014 17:13:07 UTC+1, Paul Tremberth wrote: >> >> ...meaning if you rename your "parse_item" method to "parse" you should >> be good >> >> On Friday, February 7, 2014 5:12:01 PM UTC+1, Paul Tremberth wrote: >>> >>> Hi Marco, >>> >>> when you BaseSpider, you should define the parse callback to process >>> the response for URLs in start_urls >>> Otherwise you get this NotImplementedError >>> https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L55 >>> >>> /Paul. >>> >>> On Friday, February 7, 2014 5:07:31 PM UTC+1, Marco Ippolito wrote: >>>> >>>> Hi everybody, >>>> >>>> through scrapy shell: >>>> scrapy shell http://www.ilsole24ore.com/ >>>> hxs.select('//a[contains(@href, "http")]/@href').extract() >>>> Out[1]: >>>> [u' >>>> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >>>> ', >>>> u' >>>> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >>>> ', >>>> u'http://www.ilsole24ore.com/cultura.shtml', >>>> u'http://www.casa24.ilsole24ore.com/', >>>> u'http://www.moda24.ilsole24ore.com/', >>>> u'http://food24.ilsole24ore.com/', >>>> u'http://www.motori24.ilsole24ore.com/', >>>> u'http://job24.ilsole24ore.com/', >>>> u'http://stream24.ilsole24ore.com/', >>>> u'http://www.viaggi24.ilsole24ore.com/', >>>> u'http://www.salute24.ilsole24ore.com/', >>>> u'http://www.shopping24.ilsole24ore.com/', >>>> ..... >>>> >>>> but it doesn't work outside scrapy shell: >>>> >>>> items.py: >>>> >>>> from scrapy.item import Item, Field >>>> >>>> class Sole24OreItem(Item): >>>> url = Field() >>>> pass >>>> >>>> sole.py: >>>> >>>> from scrapy.spider import BaseSpider >>>> from scrapy.selector import HtmlXPathSelector >>>> from sole24ore.items import Sole24OreItem >>>> >>>> class SoleSpider(BaseSpider): >>>> name = 'sole' >>>> allowed_domains = ['sole24ore.com'] >>>> start_urls = ['http://www.sole24ore.com/'] >>>> >>>> def parse_item(self, response): >>>> >>>> hxs = HtmlXPathSelector(response) >>>> item = Sole24OreItem() >>>> url = hxs.select('//a[contains(@href, "http")]/@href').extract() >>>> item['url'] = url >>>> >>>> return item >>>> >>>> SPIDER = SoleSpider() >>>> >>>> sole24ore]$scrapy crawl sole >>>> 2014-02-07 16:04:41+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: >>>> sole24ore) >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Optional features available: >>>> ssl, http11, boto >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Overridden settings: >>>> {'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': >>>> ['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'} >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled extensions: LogStats, >>>> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled downloader >>>> middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, >>>> UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, >>>> MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, >>>> CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled spider middlewares: >>>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>>> UrlLengthMiddleware, DepthMiddleware >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled item pipelines: >>>> 2014-02-07 16:04:41+0000 [sole] INFO: Spider opened >>>> 2014-02-07 16:04:41+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), >>>> scraped 0 items (at 0 items/min) >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Telnet console listening on >>>> 0.0.0.0:6024 >>>> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Web service listening on >>>> 0.0.0.0:6081 >>>> 2014-02-07 16:04:41+0000 [sole] DEBUG: Redirecting (301) to <GET >>>> http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> >>>> 2014-02-07 16:04:41+0000 [sole] DEBUG: Crawled (200) <GET >>>> http://www.ilsole24ore.com/> (referer: None) >>>> 2014-02-07 16:04:41+0000 [sole] ERROR: Spider error processing <GET >>>> http://www.ilsole24ore.com/> >>>> Traceback (most recent call last): >>>> File >>>> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in >>>> mainLoop >>>> self.runUntilCurrent() >>>> File >>>> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in >>>> runUntilCurrent >>>> call.func(*call.args, **call.kw) >>>> File >>>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in >>>> callback >>>> self._startRunCallbacks(result) >>>> File >>>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in >>>> _startRunCallbacks >>>> self._runCallbacks() >>>> --- <exception caught here> --- >>>> File >>>> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in >>>> _runCallbacks >>>> current.result = callback(current.result, *args, **kw) >>>> File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line >>>> 57, in parse >>>> raise NotImplementedError >>>> exceptions.NotImplementedError: >>>> >>>> 2014-02-07 16:04:41+0000 [sole] INFO: Closing spider (finished) >>>> 2014-02-07 16:04:41+0000 [sole] INFO: Dumping Scrapy stats: >>>> {'downloader/request_bytes': 448, >>>> 'downloader/request_count': 2, >>>> 'downloader/request_method_count/GET': 2, >>>> 'downloader/response_bytes': 47635, >>>> 'downloader/response_count': 2, >>>> 'downloader/response_status_count/200': 1, >>>> 'downloader/response_status_count/301': 1, >>>> 'finish_reason': 'finished', >>>> 'finish_time': datetime.datetime(2014, 2, 7, 16, 4, 41, >>>> 585750), >>>> 'log_count/DEBUG': 8, >>>> 'log_count/ERROR': 1, >>>> 'log_count/INFO': 3, >>>> 'response_received_count': 1, >>>> 'scheduler/dequeued': 2, >>>> 'scheduler/dequeued/memory': 2, >>>> 'scheduler/enqueued': 2, >>>> 'scheduler/enqueued/memory': 2, >>>> 'spider_exceptions/NotImplementedError': 1, >>>> 'start_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 240417)} >>>> 2014-02-07 16:04:41+0000 [sole] INFO: Spider closed (finished) >>>> >>>> >>>> Any hints to help me? >>>> >>>> Thank you very much. >>>> Kind regards. >>>> Marco >>>> >>>> >>>> >>>> >>>>
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
