I'd still like an answer to the previous questions, but as an update, I
deleted dmoz_debug2 and then ran runspider on tutfollinksc again. It ran...
but then I got a NotImplemented Error, so I am right back where I started.
malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider
tutfollinksc.py -o tutfollinks_dmoz_c.json
2015-10-15 21:27:42 [scrapy] INFO: Scrapy 1.0.3.post6+g2d688cd started
(bot: tutorial)
2015-10-15 21:27:42 [scrapy] INFO: Optional features available: ssl, http11
2015-10-15 21:27:42 [scrapy] INFO: Overridden settings:
{'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'json',
'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI':
'tutfollinks_dmoz_c.json', 'BOT_NAME': 'tutorial'}
2015-10-15 21:27:42 [scrapy] INFO: Enabled extensions: CloseSpider,
FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-15 21:27:42 [scrapy] INFO: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware,
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2015-10-15 21:27:42 [scrapy] INFO: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware,
UrlLengthMiddleware, DepthMiddleware
2015-10-15 21:27:42 [scrapy] INFO: Enabled item pipelines:
2015-10-15 21:27:42 [scrapy] INFO: Spider opened
2015-10-15 21:27:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2015-10-15 21:27:42 [scrapy] DEBUG: Telnet console listening on
127.0.0.1:6023
2015-10-15 21:27:43 [scrapy] DEBUG: Crawled (200) <GET
http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)
2015-10-15 21:27:43 [scrapy] ERROR: Spider error processing <GET
http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line
577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76,
in parse
raise NotImplementedError
NotImplementedError
2015-10-15 21:27:43 [scrapy] INFO: Closing spider (finished)
2015-10-15 21:27:43 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 264,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 7386,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 10, 16, 2, 27, 43, 759336),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/NotImplementedError': 1,
'start_time': datetime.datetime(2015, 10, 16, 2, 27, 42, 970895)}
2015-10-15 21:27:43 [scrapy] INFO: Spider closed (finished)
malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$
Some enlightenment greatly appreciated
On Monday, October 12, 2015 at 8:52:13 PM UTC-5, Malik Rumi wrote:
>
> I posted a related question to Stack Overflow at
> http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback,
> but so far it has no answers.
>
> I am not able to get a spider to crawl past the first page of any site I
> have tried, despite many iterations and many re-reads of the docs. I
> decided to test it against the example code from the docs.
> The only change I made was to the name, so I could tell it apart.
>
> '''
> Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
> Run this, as is, on Dmoz.
> '''
>
> import scrapy
> from tutorial.items import DmozItem
>
> class DmozSpider(scrapy.Spider):
> name = "tutfollinks"
> allowed_domains = ["dmoz.org"]
> start_urls = [
> "http://www.dmoz.org/Computers/Programming/Languages/Python/",
> ]
>
> def parse(self, response):
> for href in response.css("ul.directory.dir-col > li >
> a::attr('href')"):
> url = response.urljoin(href.extract())
> yield scrapy.Request(url, callback=self.parse_dir_contents)
>
> def parse_dir_contents(self, response):
> for sel in response.xpath('//ul/li'):
> item = DmozItem()
> item['title'] = sel.xpath('a/text()').extract()
> item['link'] = sel.xpath('a/@href').extract()
> item['desc'] = sel.xpath('text()').extract()
> yield item
>
> And here is what I got:
>
> Traceback (most recent call last):
> File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line
> 577, in _runCallbacks
> current.result = callback(current.result, *args, **kw)
> File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76,
> in parse
> raise NotImplementedError
> NotImplementedError
> 2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished)
>
> When I googled the error, my first hit was:
>
> http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site
>
> The answer, according to the OP, was to change from BaseSpider to
> CrawlSpider. But, I repeat, this is copied verbatim from the example in
> the docs. Then how can it throw an error? In fact, the whole point of the
> example in the docs is to show how to crawl a site WITHOUT CrawlSpider,
> which is introduced for the first time in a note at the end of section 2.3.4
>
> Another SO post had a similar issue, but in that case the original code
> was subclassed from CrawlSpider, and the OP was told he had accidentally
> overwritten parse(). But I see parse() being used in various examples in
> the docs, including this one. What, exactly, constitutes 'overwriting
> parse()'? Is it adding variables like the example in the docs do? How can
> that be?
>
> Furthermore, the callback in this case is explicitly not parse, but
> parse_dir_contents.
>
> What is going on here? Please, I'd like a why explanation as well as the
> hopefully simple answer. Thanks.
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.