Hello, I've now spent 2 full days on finding a solution for my problem: When I start my scrapy spider from the terminal I manage to get my results as CSV and I can see them in the terminal. I seem to be unable to do the same when I start the spider from within a script. What I want to do in the end is to call Scrapy from a script (passing a few parameters) and get the results as a list or a Pandas DataFrame.
I have googled and views all discussion posts I could find, incl.: - http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script - http://doc.scrapy.org/en/latest/topics/practices.html - http://stackoverflow.com/questions/16994768/how-to-call-particular-scrapy-spiders-from-another-python-script Does anyone have a working collection of scripts that allow to call scrapy from within a script and collect the results? Does anyone see a flaw in my approach? Thank you very much for your help! I am happy to help you in return with whatever I might be helpful at (probably not much...). Pascal *My callScrapy script:* import os os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'Scraper.settings') from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy.utils.project import get_project_settings # Import the actual scraper from Scraper.spiders.GoogleScraper_v1 import GoogleScraper_v1 def stop_reactor(): reactor.stop() def start_scraper(): # from: https://scrapy.readthedocs.org/en/latest/topics/practices.html spider = GoogleScraper_v1() settings = get_project_settings() print str(settings) crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg('Running reactor...') reactor.run() # the script will block here until the spider is closed log.msg('Reactor stopped.') if __name__ == "__main__": # Call method to start scraper start_scraper() *My pipeline:* from scrapy import signals from scrapy.contrib.exporter import CsvItemExporter # Exports scraped results as CSV file # Source: # http://stackoverflow.com/questions/20753358/how-can-i-use-the-fields-to-export-attribute-in-baseitemexporter-to-order-my-scr class CSVPipeline(object): def __init__(self): self.files = {} @classmethod def from_crawler(cls, crawler): pipeline = cls() crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) return pipeline def spider_opened(self, spider): file = open('%s_items.csv' % spider.name, 'w+b') self.files[spider] = file self.exporter = CsvItemExporter(file) self.exporter.fields_to_export = ['desc', 'link'] # <--- was kommt hierhin? statt desc? self.exporter.start_exporting() def spider_closed(self, spider): self.exporter.finish_exporting() file = self.files.pop(spider) file.close() def process_item(self, item, spider): self.exporter.export_item(item) return item class NormalPipeline(object): def process_item(self, item, spider): return item *My console output:*2015-03-13 12:41:06+0000 [scrapy] INFO: Running reactor... 2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Closing spider (finished) 2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 259, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 113002, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 628538), * 'item_scraped_count': 69,* 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 52977)} 2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Spider closed (finished) 2015-03-13 12:41:06+0000 [scrapy] INFO: Reactor stopped. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
