Hi Capi, thank you very much. I was looking for the json file in /home/marco/crawlscrape/urls_listing as explicitly declared in settings.py. You are right the json file, contrary to what I expected, is in /var/lib/scrapyd/items/urls_listing/urls_grasping
Thanks for helping. Marco 2015-02-04 16:59 GMT+01:00 Capi Etheriel <[email protected]>: > 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 > items) in: > /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl > right? > > Em sábado, 31 de janeiro de 2015 08:00:42 UTC-2, Marco Ippolito escreveu: >> >> Any suggestions? >> >> Marco >> >> On Monday, 26 January 2015 19:00:14 UTC+1, Marco Ippolito wrote: >>> >>> Hi, >>> >>> I'm trying to export the scrapyd's output to a json file. >>> >>> marco@pc:~/crawlscrape/urls_listing$ curl >>> http://localhost:6800/listversions.json?project="urls_listing" >>> {"status": "ok", "versions": []} >>> marco@pc:~/crawlscrape/urls_listing$ scrapyd-deploy urls_listing -p >>> urls_listing >>> Packing version 1422294714 >>> Deploying to project "urls_listing" in >>> http://localhost:6800/addversion.json >>> Server response (200): >>> {"status": "ok", "project": "urls_listing", "version": "1422294714", >>> "spiders": 1} >>> >>> marco@pc:~/crawlscrape/urls_listing$ curl >>> http://localhost:6800/schedule.json -d project=urls_listing -d >>> spider=urls_grasping >>> {"status": "ok", "jobid": "0b4518bea58411e482bcc04a00090e80"} >>> >>> And this is the log file: >>> >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: >>> urls_listing) >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Optional features available: ssl, >>> http11 >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Overridden settings: >>> {'NEWSPIDER_MODULE': 'urls_listing.spiders', 'SPIDER_MODULES': >>> ['urls_listing.spiders'], 'FEED_URI': >>> '/var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04\ >>> a00090e80.jl', 'LOG_FILE': >>> '/var/log/scrapyd/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.log', >>> 'BOT_NAME': 'urls_listing'} >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled extensions: FeedExporter, >>> LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled downloader middlewares: >>> HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, >>> RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, >>> HttpCompressionMiddleware, Red\ >>> irectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, >>> DownloaderStats >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled spider middlewares: >>> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, >>> UrlLengthMiddleware, DepthMiddleware >>> 2015-01-26 18:52:08+0100 [scrapy] INFO: Enabled item pipelines: >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider opened >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Crawled 0 pages (at 0 >>> pages/min), scraped 0 items (at 0 items/min) >>> 2015-01-26 18:52:08+0100 [scrapy] DEBUG: Telnet console listening on >>> 127.0.0.1:6023 >>> 2015-01-26 18:52:08+0100 [scrapy] DEBUG: Web service listening on >>> 127.0.0.1:6080 >>> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Redirecting (301) to <GET >>> http://www.ilsole24ore.com/> from <GET http://www.sole24ore.com/> >>> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Crawled (200) <GET >>> http://www.ilsole24ore.com/> (referer: None) >>> 2015-01-26 18:52:08+0100 [urls_grasping] DEBUG: Scraped from <200 >>> http://www.ilsole24ore.com/> >>> {'url': >>> [u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml', >>> >>> u'http://www.ilsole24ore.com/ebook/norme-e-tributi/2015/ravvedimento/index.shtml', >>> u'http://www.ilsole24ore.com/cultura.shtml', >>> u'http://www.casa24.ilsole24ore.com/', >>> u'http://www.moda24.ilsole24ore.com/', >>> u'http://food24.ilsole24ore.com/', >>> u'http://www.motori24.ilsole24ore.com/', >>> u'http://job24.ilsole24ore.com/', >>> u'http://stream24.ilsole24ore.com/', >>> u'http://www.viaggi24.ilsole24ore.com/', >>> u'http://www.salute24.ilsole24ore.com/', >>> u'http://www.shopping24.ilsole24ore.com/', >>> u'http://www.radio24.ilsole24ore.com/', >>> u'http://america24.com/', >>> u'http://meteo24.ilsole24ore.com/', >>> u'https://24orecloud.ilsole24ore.com/', >>> u'http://www.ilsole24ore.com/feed/agora/agora.shtml', >>> u'http://www.formazione.ilsole24ore.com/', >>> u'http://nova.ilsole24ore.com/', >>> ......(omitted) >>> u'http://websystem.ilsole24ore.com/', >>> u'http://www.omniture.com']} >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Closing spider (finished) >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Stored jsonlines feed (1 >>> items) in: >>> /var/lib/scrapyd/items/urls_listing/urls_grasping/0b4518bea58411e482bcc04a00090e80.jl >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Dumping Scrapy stats: >>> {'downloader/request_bytes': 434, >>> 'downloader/request_count': 2, >>> 'downloader/request_method_count/GET': 2, >>> 'downloader/response_bytes': 51709, >>> 'downloader/response_count': 2, >>> 'downloader/response_status_count/200': 1, >>> 'downloader/response_status_count/301': 1, >>> 'finish_reason': 'finished', >>> 'finish_time': datetime.datetime(2015, 1, 26, 17, 52, 8, >>> 820513), >>> 'item_scraped_count': 1, >>> 'log_count/DEBUG': 5, >>> 'log_count/INFO': 8, >>> 'response_received_count': 1, >>> 'scheduler/dequeued': 2, >>> 'scheduler/dequeued/memory': 2, >>> 'scheduler/enqueued': 2, >>> 'scheduler/enqueued/memory': 2, >>> 'start_time': datetime.datetime(2015, 1, 26, 17, 52, 8, 612923)} >>> 2015-01-26 18:52:08+0100 [urls_grasping] INFO: Spider closed (finished) >>> >>> >>> But there is no output.json: >>> marco@pc:~/crawlscrape/urls_listing$ ls -a >>> . .. build project.egg-info scrapy.cfg setup.py urls_listing >>> >>> in ~/crawlscrape/urls_listing/urls/listing: >>> in items.py: >>> class UrlsListingItem(scrapy.Item): >>> # define the fields for your item here like: >>> #url = scrapy.Field() >>> #url = scrapy.Field(serializer=UrlsListingJsonExporter) >>> url = scrapy.Field(serializer=serialize_url) >>> pass >>> >>> in pipelines.py I put: >>> >>> class JsonExportPipeline(object):ì >>> def __init__(self): >>> dispatcher.connect(self.spider_opened, signals.spider_opened) >>> dispatcher.connect(self.spider_closed, signals.spider_closed) >>> self.files = {} >>> def spider_opened(self, spider): >>> file = open('%s_items.json' % spider.name, 'w+b') >>> self.files[spider] = file >>> self.exporter = JsonLinesItemExporter(file) >>> self.exporter.start_exporting()ì >>> def spider_closed(self, spider): >>> self.exporter.finish_exporting() >>> file = self.files.pop(spider) >>> file.close() >>> def process_item(self, item, spider): >>> self.exporter.export_item(item) >>> return item >>> >>> in settings.py I put: >>> BOT_NAME = 'urls_listing' >>> >>> SPIDER_MODULES = ['urls_listing.spiders'] >>> NEWSPIDER_MODULE = 'urls_listing.spiders' >>> >>> FEED_URI = 'file://home/marco/crawlscrape/urls_listing/output.json' >>> #FEED_URI = 'output.json' >>> FEED_FORMAT = 'jsonlines' >>> >>> FEED_EXPORTERS = { >>> 'jsonlines': 'scrapy.contrib.exporter.JsonLinesItemExporter', >>> } >>> >>> What am I doing wrongly? >>> Looking forward to your kind help. >>> Marco > > -- > You received this message because you are subscribed to a topic in the > Google Groups "scrapy-users" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/scrapy-users/qBJ85PUor20/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
