For reference,
this question was answered on StackOverlow 
at https://stackoverflow.com/questions/35417865

and an issue has been opened to be able to customize the tempfile 
location: https://github.com/scrapy/scrapy/issues/1779

/Paul

On Thursday, February 11, 2016 at 2:31:18 PM UTC+1, [email protected] 
wrote:
>
> Hello,
>
> I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an 
> Amazon EC2 instance. I'm exporting jsonlines files to S3 using these 
> parameters in my spider/settings.py file:
>
> FEED_FORMAT: jsonlines
> FEED_URI: s3://my-bucket-name
>
> I've done this a number of times successfully without any issues. But I'm 
> running into a problem on one particularly large crawl: the local disk 
> (which isn't particularly big) fills up with the in-progress crawl's data 
> before it can fully complete, and thus before the results can be uploaded 
> to S3.
>
> I'm wondering if there is any way to configure where the "intermediate" 
> results of this crawl can be written prior to being uploaded to S3? I'm 
> assuming that however Scrapy internally represents the in-progress crawl 
> data is not held entirely in RAM but put on disk somewhere, and if that's 
> the case, I'd like to set that location to an external mount with enough 
> space to hold the results before shipping the completed .jl file to S3.
>
> Thanks, and apologies if this is in the documentation and I couldn't find 
> it.
>
> Brian
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to