From: Steve Loughran [mailto:ste...@hortonworks.com]
> > Sent: Saturday, September 30, 2017 6:10 AM
> > To: JG Perrin <jper...@lumeris.com>
> > Cc: Alexander Czech <alexander.cz...@googlemail.com>;
> user@spark.apache.org
> > Subject: Re: HDFS or NFS as a
> From: Steve Loughran [mailto:ste...@hortonworks.com]
> Sent: Saturday, September 30, 2017 6:10 AM
> To: JG Perrin <jper...@lumeris.com>
> Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org
> Subject: Re: HDFS or NFS as a cache?
>
>
>
[mailto:ste...@hortonworks.com]
Sent: Saturday, September 30, 2017 6:10 AM
To: JG Perrin <jper...@lumeris.com>
Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org
Subject: Re: HDFS or NFS as a cache?
On 29 Sep 2017, at 20:03, JG Perrin
<jper...@lumeris.
On 29 Sep 2017, at 20:03, JG Perrin
> wrote:
You will collect in the driver (often the master) and it will save the data, so
for saving, you will not have to set up HDFS.
no, it doesn't work quite like that.
1. workers generate their data and
On 29 Sep 2017, at 15:59, Alexander Czech
> wrote:
Yes I have identified the rename as the problem, that is why I think the extra
bandwidth of the larger instances might not help. Also there is a consistency
issue with S3
You will collect in the driver (often the master) and it will save the data, so
for saving, you will not have to set up HDFS.
From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
Sent: Friday, September 29, 2017 8:15 AM
To: user@spark.apache.org
Subject: HDFS or NFS as a cache?
I have
Yes I have identified the rename as the problem, that is why I think the
extra bandwidth of the larger instances might not help. Also there is a
consistency issue with S3 because of the how the rename works so that I
probably lose data.
On Fri, Sep 29, 2017 at 4:42 PM, Vadim Semenov
How many files you produce? I believe it spends a lot of time on renaming
the files because of the output committer.
Also instead of 5x c3.2xlarge try using 2x c3.8xlarge instead because they
have 10GbE and you can get good throughput for S3.
On Fri, Sep 29, 2017 at 9:15 AM, Alexander Czech <