Getting Scrapy to output results when called from a script

pascal . wichmann Fri, 13 Mar 2015 05:59:24 -0700

Hello,

I've now spent 2 full days on finding a solution for my problem:
When I start my scrapy spider from the terminal I manage to get my results 
as CSV and I can see them in the terminal.
I seem to be unable to do the same when I start the spider from within a 
script. What I want to do in the end is to call Scrapy from a script 
(passing a few parameters) and get the results as a list or a Pandas 
DataFrame.


I have googled and views all discussion posts I could find, incl.:

   - 
   
http://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script
   - http://doc.scrapy.org/en/latest/topics/practices.html
   - 
   
http://stackoverflow.com/questions/16994768/how-to-call-particular-scrapy-spiders-from-another-python-script
   
Does anyone have a working collection of scripts that allow to call scrapy 
from within a script and collect the results?
Does anyone see a flaw in my approach?

Thank you very much for your help! I am happy to help you in return with 
whatever I might be helpful at (probably not much...).

Pascal


*My callScrapy script:*

import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'Scraper.settings')

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings

# Import the actual scraper
from Scraper.spiders.GoogleScraper_v1 import GoogleScraper_v1

def stop_reactor():
    reactor.stop()

def start_scraper():

    # from: https://scrapy.readthedocs.org/en/latest/topics/practices.html
    spider = GoogleScraper_v1()
    settings = get_project_settings()
    print str(settings)
    crawler = Crawler(settings)

    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start()
    log.msg('Running reactor...')
    reactor.run()  # the script will block here until the spider is closed
    log.msg('Reactor stopped.')

if __name__ == "__main__":
    
    # Call method to start scraper
    start_scraper()

*My pipeline:*

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter

# Exports scraped results as CSV file
# Source:
# 
http://stackoverflow.com/questions/20753358/how-can-i-use-the-fields-to-export-attribute-in-baseitemexporter-to-order-my-scr
class CSVPipeline(object):
    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, 
signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, 
signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        file = open('%s_items.csv' % spider.name, 'w+b')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.fields_to_export = ['desc', 'link'] # <--- was kommt 
hierhin? statt desc?
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

class NormalPipeline(object):
    def process_item(self, item, spider):
        return item



*My console output:*2015-03-13 12:41:06+0000 [scrapy] INFO: Running 
reactor...
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Closing spider (finished)
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 259,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 113002,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 628538),
    * 'item_scraped_count': 69,*
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2015, 3, 13, 12, 41, 6, 52977)}
2015-03-13 12:41:06+0000 [GoogleScraper_v1] INFO: Spider closed (finished)
2015-03-13 12:41:06+0000 [scrapy] INFO: Reactor stopped.


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Getting Scrapy to output results when called from a script

Reply via email to