For reference, this question was answered on StackOverlow at https://stackoverflow.com/questions/35417865
and an issue has been opened to be able to customize the tempfile location: https://github.com/scrapy/scrapy/issues/1779 /Paul On Thursday, February 11, 2016 at 2:31:18 PM UTC+1, [email protected] wrote: > > Hello, > > I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an > Amazon EC2 instance. I'm exporting jsonlines files to S3 using these > parameters in my spider/settings.py file: > > FEED_FORMAT: jsonlines > FEED_URI: s3://my-bucket-name > > I've done this a number of times successfully without any issues. But I'm > running into a problem on one particularly large crawl: the local disk > (which isn't particularly big) fills up with the in-progress crawl's data > before it can fully complete, and thus before the results can be uploaded > to S3. > > I'm wondering if there is any way to configure where the "intermediate" > results of this crawl can be written prior to being uploaded to S3? I'm > assuming that however Scrapy internally represents the in-progress crawl > data is not held entirely in RAM but put on disk somewhere, and if that's > the case, I'd like to set that location to an external mount with enough > space to hold the results before shipping the completed .jl file to S3. > > Thanks, and apologies if this is in the documentation and I couldn't find > it. > > Brian > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
