I'd still like an answer to the previous questions, but as an update, I 
deleted dmoz_debug2 and then ran runspider on tutfollinksc again. It ran... 
but then I got a NotImplemented Error, so I am right back where I started. 

malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ scrapy runspider 
tutfollinksc.py -o tutfollinks_dmoz_c.json
2015-10-15 21:27:42 [scrapy] INFO: Scrapy 1.0.3.post6+g2d688cd started 
(bot: tutorial)
2015-10-15 21:27:42 [scrapy] INFO: Optional features available: ssl, http11
2015-10-15 21:27:42 [scrapy] INFO: Overridden settings: 
{'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'json', 
'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 
'tutfollinks_dmoz_c.json', 'BOT_NAME': 'tutorial'}
2015-10-15 21:27:42 [scrapy] INFO: Enabled extensions: CloseSpider, 
FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2015-10-15 21:27:42 [scrapy] INFO: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, 
RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, 
HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, 
ChunkedTransferMiddleware, DownloaderStats
2015-10-15 21:27:42 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2015-10-15 21:27:42 [scrapy] INFO: Enabled item pipelines: 
2015-10-15 21:27:42 [scrapy] INFO: Spider opened
2015-10-15 21:27:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2015-10-15 21:27:42 [scrapy] DEBUG: Telnet console listening on 
127.0.0.1:6023
2015-10-15 21:27:43 [scrapy] DEBUG: Crawled (200) <GET 
http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)
2015-10-15 21:27:43 [scrapy] ERROR: Spider error processing <GET 
http://www.dmoz.org/Computers/Programming/Languages/Python/> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 
577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, 
in parse
    raise NotImplementedError
NotImplementedError
2015-10-15 21:27:43 [scrapy] INFO: Closing spider (finished)
2015-10-15 21:27:43 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 264,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 7386,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 16, 2, 27, 43, 759336),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/NotImplementedError': 1,
 'start_time': datetime.datetime(2015, 10, 16, 2, 27, 42, 970895)}
2015-10-15 21:27:43 [scrapy] INFO: Spider closed (finished)
malikarumi@Tetuoan2:~/Projects/tutorial/tutorial/spiders$ 


Some enlightenment greatly appreciated



On Monday, October 12, 2015 at 8:52:13 PM UTC-5, Malik Rumi wrote:
>
> I posted a related question to Stack Overflow at 
> http://stackoverflow.com/questions/33084480/scrapy-error-can-t-find-callback, 
> but so far it has no answers.
>
> I am not able to get a spider to crawl past the first page of any site I 
> have tried, despite many iterations and many re-reads of the docs. I 
> decided to test it against the example code from the docs.
> The only change I made was to the name, so I could tell it apart.
>
> '''
> Copied from Scrapy 1.03 docs at pdf page 15, section 2.3, Scrapy Tutorial
> Run this, as is, on Dmoz.
> '''
>
> import scrapy
> from tutorial.items import DmozItem
>
> class DmozSpider(scrapy.Spider):
>     name = "tutfollinks"
>     allowed_domains = ["dmoz.org"]
>     start_urls = [
>         "http://www.dmoz.org/Computers/Programming/Languages/Python/";,
>     ]
>
>     def parse(self, response):
>         for href in response.css("ul.directory.dir-col > li > 
> a::attr('href')"):
>             url = response.urljoin(href.extract())
>             yield scrapy.Request(url, callback=self.parse_dir_contents)
>
>     def parse_dir_contents(self, response):
>         for sel in response.xpath('//ul/li'):
>             item = DmozItem()
>             item['title'] = sel.xpath('a/text()').extract()
>             item['link'] = sel.xpath('a/@href').extract()
>             item['desc'] = sel.xpath('text()').extract()
>             yield item
>
> And here is what I got:
>
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 
> 577, in _runCallbacks
>     current.result = callback(current.result, *args, **kw)
>   File "/usr/lib/pymodules/python2.7/scrapy/spiders/__init__.py", line 76, 
> in parse
>     raise NotImplementedError
> NotImplementedError
> 2015-10-12 19:31:21 [scrapy] INFO: Closing spider (finished)
>
> When I googled the error, my first hit was:
>
> http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a-site
>
> The answer, according to the OP, was to change from BaseSpider to 
> CrawlSpider. But, I repeat, this is copied  verbatim from the example in 
> the docs. Then how can it throw an error? In fact, the whole point of the 
> example in the docs is to show how to crawl a site WITHOUT CrawlSpider, 
> which is introduced for the first time in a note at the end of section 2.3.4
>
> Another SO post had a similar issue, but in that case the original code 
> was subclassed from CrawlSpider, and the OP was told he had accidentally 
> overwritten parse(). But I see parse() being used in various examples in 
> the docs, including this one. What, exactly, constitutes 'overwriting 
> parse()'? Is it adding variables like the example in the docs do? How can 
> that be?
>
> Furthermore, the callback in this case is explicitly not parse, but 
> parse_dir_contents.
>
> What is going on here? Please, I'd like a why explanation as well as the 
> hopefully simple answer. Thanks. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to