...meaning if you rename your "parse_item" method to "parse" you should be good
On Friday, February 7, 2014 5:12:01 PM UTC+1, Paul Tremberth wrote: > > Hi Marco, > > when you BaseSpider, you should define the parse callback to process the > response for URLs in start_urls > Otherwise you get this NotImplementedError > https://github.com/scrapy/scrapy/blob/master/scrapy/spider.py#L55 > > /Paul. > > On Friday, February 7, 2014 5:07:31 PM UTC+1, Marco Ippolito wrote: >> >> Hi everybody, >> >> through scrapy shell: >> scrapy shell http://www.ilsole24ore.com/ >> hxs.select('//a[contains(@href, "http")]/@href').extract() >> Out[1]: >> [u' >> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >> ', >> u' >> http://www.ilsole24ore.com/ebook/norme-e-tributi/2014/crisi_impresa/index.shtml >> ', >> u'http://www.ilsole24ore.com/cultura.shtml', >> u'http://www.casa24.ilsole24ore.com/', >> u'http://www.moda24.ilsole24ore.com/', >> u'http://food24.ilsole24ore.com/', >> u'http://www.motori24.ilsole24ore.com/', >> u'http://job24.ilsole24ore.com/', >> u'http://stream24.ilsole24ore.com/', >> u'http://www.viaggi24.ilsole24ore.com/', >> u'http://www.salute24.ilsole24ore.com/', >> u'http://www.shopping24.ilsole24ore.com/', >> ..... >> >> but it doesn't work outside scrapy shell: >> >> items.py: >> >> from scrapy.item import Item, Field >> >> class Sole24OreItem(Item): >> url = Field() >> pass >> >> sole.py: >> >> from scrapy.spider import BaseSpider >> from scrapy.selector import HtmlXPathSelector >> from sole24ore.items import Sole24OreItem >> >> class SoleSpider(BaseSpider): >> name = 'sole' >> allowed_domains = ['sole24ore.com'] >> start_urls = ['http://www.sole24ore.com/'] >> >> def parse_item(self, response): >> >> hxs = HtmlXPathSelector(response) >> item = Sole24OreItem() >> url = hxs.select('//a[contains(@href, "http")]/@href').extract() >> item['url'] = url >> >> return item >> >> SPIDER = SoleSpider() >> >> sole24ore]$scrapy crawl sole >> 2014-02-07 16:04:41+0000 [scrapy] INFO: Scrapy 0.18.4 started (bot: >> sole24ore) >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Optional features available: >> ssl, http11, boto >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Overridden settings: >> {'NEWSPIDER_MODULE': 'sole24ore.spiders', 'SPIDER_MODULES': >> ['sole24ore.spiders'], 'BOT_NAME': 'sole24ore'} >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled extensions: LogStats, >> TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled downloader middlewares: >> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >> HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, >> ChunkedTransferMiddleware, DownloaderStats >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled spider middlewares: >> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >> UrlLengthMiddleware, DepthMiddleware >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Enabled item pipelines: >> 2014-02-07 16:04:41+0000 [sole] INFO: Spider opened >> 2014-02-07 16:04:41+0000 [sole] INFO: Crawled 0 pages (at 0 pages/min), >> scraped 0 items (at 0 items/min) >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Telnet console listening on >> 0.0.0.0:6024 >> 2014-02-07 16:04:41+0000 [scrapy] DEBUG: Web service listening on >> 0.0.0.0:6081 >> 2014-02-07 16:04:41+0000 [sole] DEBUG: Redirecting (301) to <GET >> http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> >> 2014-02-07 16:04:41+0000 [sole] DEBUG: Crawled (200) <GET >> http://www.ilsole24ore.com/> (referer: None) >> 2014-02-07 16:04:41+0000 [sole] ERROR: Spider error processing <GET >> http://www.ilsole24ore.com/> >> Traceback (most recent call last): >> File >> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in >> mainLoop >> self.runUntilCurrent() >> File >> "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in >> runUntilCurrent >> call.func(*call.args, **call.kw) >> File >> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in >> callback >> self._startRunCallbacks(result) >> File >> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in >> _startRunCallbacks >> self._runCallbacks() >> --- <exception caught here> --- >> File >> "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in >> _runCallbacks >> current.result = callback(current.result, *args, **kw) >> File "/usr/lib/pymodules/python2.7/scrapy/spider.py", line 57, >> in parse >> raise NotImplementedError >> exceptions.NotImplementedError: >> >> 2014-02-07 16:04:41+0000 [sole] INFO: Closing spider (finished) >> 2014-02-07 16:04:41+0000 [sole] INFO: Dumping Scrapy stats: >> {'downloader/request_bytes': 448, >> 'downloader/request_count': 2, >> 'downloader/request_method_count/GET': 2, >> 'downloader/response_bytes': 47635, >> 'downloader/response_count': 2, >> 'downloader/response_status_count/200': 1, >> 'downloader/response_status_count/301': 1, >> 'finish_reason': 'finished', >> 'finish_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 585750), >> 'log_count/DEBUG': 8, >> 'log_count/ERROR': 1, >> 'log_count/INFO': 3, >> 'response_received_count': 1, >> 'scheduler/dequeued': 2, >> 'scheduler/dequeued/memory': 2, >> 'scheduler/enqueued': 2, >> 'scheduler/enqueued/memory': 2, >> 'spider_exceptions/NotImplementedError': 1, >> 'start_time': datetime.datetime(2014, 2, 7, 16, 4, 41, 240417)} >> 2014-02-07 16:04:41+0000 [sole] INFO: Spider closed (finished) >> >> >> Any hints to help me? >> >> Thank you very much. >> Kind regards. >> Marco >> >> >> >> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
