On 29 Sep 2017, at 20:03, JG Perrin <jper...@lumeris.com<mailto:jper...@lumeris.com>> wrote:
You will collect in the driver (often the master) and it will save the data, so for saving, you will not have to set up HDFS. no, it doesn't work quite like that. 1. workers generate their data and save somwhere 2. on "task commit" they move their data to some location where it will be visible for "job commit" (rename, upload, whatever) 3. job commit —which is done in the driver,— takes all the committed task data and makes it visible in the destination directory. 4. Then they create a _SUCCESS file to say "done!" This is done with Spark talking between workers and drivers to guarantee that only one task working on a specific part of the data commits their work, only committing the job once all tasks have finished The v1 mapreduce committer implements (2) by moving files under a job attempt dir, and (3) by moving it from the job attempt dir to the destination. one rename per task commit, another rename of every file on job commit. In HFDS, Azure wasb and other stores with an O(1) atomic rename, this isn't *too* expensve, though that final job commit rename still takes time to list and move lots of files The v2 committer implements (2) by renaming to the destination directory and (3) as a no-op. Rename in the tasks then, but not not that second, serialized one at the end There's no copy of data from workers to driver, instead you need a shared output filesystem so that the job committer can do its work alongside the tasks. There are alternatives committer agorithms, 1. look at Ryan Blue's talk: https://www.youtube.com/watch?v=BgHrff5yAQo 2. IBM Stocator paper (https://arxiv.org/abs/1709.01812) and code (https://github.com/SparkTC/stocator/) 3. Ongoing work in Hadoop itself for better committers. Goal: year end & Hadoop 3.1 https://issues.apache.org/jira/browse/HADOOP-13786 . The oode is all there, Parquet is a troublespot, and more testing is welcome from anyone who wants to help. 4. Databricks have "something"; specifics aren't covered, but I assume its dynamo DB based -Steve From: Alexander Czech [mailto:alexander.cz...@googlemail.com] Sent: Friday, September 29, 2017 8:15 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: HDFS or NFS as a cache? I have a small EC2 cluster with 5 c3.2xlarge nodes and I want to write parquet files to S3. But the S3 performance for various reasons is bad when I access s3 through the parquet write method: df.write.parquet('s3a://bucket/parquet') Now I want to setup a small cache for the parquet output. One output is about 12-15 GB in size. Would it be enough to setup a NFS-directory on the master, write the output to it and then move it to S3? Or should I setup a HDFS on the Master? Or should I even opt for an additional cluster running a HDFS solution on more than one node? thanks!