You don't need to collect data in the driver to save it. The code in
the original question doesn't use "collect()", so it's actually doing
a distributed write.


On Mon, Oct 2, 2017 at 11:26 AM, JG Perrin <jper...@lumeris.com> wrote:
> Steve,
>
>
>
> If I refer to the collect() API, it says “Running collect requires moving
> all the data into the application's driver process, and doing so on a very
> large dataset can crash the driver process with OutOfMemoryError.” So why
> would you need a distributed FS?
>
>
>
> jg
>
>
>
> From: Steve Loughran [mailto:ste...@hortonworks.com]
> Sent: Saturday, September 30, 2017 6:10 AM
> To: JG Perrin <jper...@lumeris.com>
> Cc: Alexander Czech <alexander.cz...@googlemail.com>; user@spark.apache.org
> Subject: Re: HDFS or NFS as a cache?
>
>
>
>
>
> On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.com> wrote:
>
>
>
> You will collect in the driver (often the master) and it will save the data,
> so for saving, you will not have to set up HDFS.
>
>
>
> no, it doesn't work quite like that.
>
>
>
> 1. workers generate their data and save somwhere
>
> 2. on "task commit" they move their data to some location where it will be
> visible for "job commit" (rename, upload, whatever)
>
> 3. job commit —which is done in the driver,— takes all the committed task
> data and makes it visible in the destination directory.
>
> 4. Then they create a _SUCCESS file to say "done!"
>
>
>
>
>
> This is done with Spark talking between workers and drivers to guarantee
> that only one task working on a specific part of the data commits their
> work, only
>
> committing the job once all tasks have finished
>
>
>
> The v1 mapreduce committer implements (2) by moving files under a job
> attempt dir, and (3) by moving it from the job attempt dir to the
> destination. one rename per task commit, another rename of every file on job
> commit. In HFDS, Azure wasb and other stores with an O(1) atomic rename,
> this isn't *too* expensve, though that final job commit rename still takes
> time to list and move lots of files
>
>
>
> The v2 committer implements (2) by renaming to the destination directory and
> (3) as a no-op. Rename in the tasks then, but not not that second,
> serialized one at the end
>
>
>
> There's no copy of data from workers to driver, instead you need a shared
> output filesystem so that the job committer can do its work alongside the
> tasks.
>
>
>
> There are alternatives committer agorithms,
>
>
>
> 1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo
>
> 2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code
> (https://github.com/SparkTC/stocator/)
>
> 3. Ongoing work in Hadoop itself for better committers. Goal: year end &
> Hadoop 3.1 https://issues.apache.org/jira/browse/HADOOP-13786 . The oode is
> all there, Parquet is a troublespot, and more testing is welcome from anyone
> who wants to help.
>
> 4. Databricks have "something"; specifics aren't covered, but I assume its
> dynamo DB based
>
>
>
>
>
> -Steve
>
>
>
>
>
>
>
>
>
>
>
>
>
> From: Alexander Czech [mailto:alexander.cz...@googlemail.com]
> Sent: Friday, September 29, 2017 8:15 AM
> To: user@spark.apache.org
> Subject: HDFS or NFS as a cache?
>
>
>
> I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write
> parquet files to S3. But the S3 performance for various reasons is bad when
> I access s3 through the parquet write method:
>
> df.write.parquet('s3a://bucket/parquet')
>
> Now I want to setup a small cache for the parquet output. One output is
> about 12-15 GB in size. Would it be enough to setup a NFS-directory on the
> master, write the output to it and then move it to S3? Or should I setup a
> HDFS on the Master? Or should I even opt for an additional cluster running a
> HDFS solution on more than one node?
>
> thanks!
>
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to