It is an open issue with Hadoop file committer, not spark. The simple workaround is to write to hdfs then copy to s3. Netflix did a talk about their custom output committer at the last spark summit which is a clever efficient way of doing that - I’d check it out on YouTube. They have open sourced their implementation, but it does not work (out the box) with parquet.
-TJ On Mon, Nov 20, 2017 at 11:48 Jim Carroll <jimfcarr...@gmail.com> wrote: > I have this exact issue. I was going to intercept the call in the > filesystem > if I had to (since we're using the S3 filesystem from Presto anyway) but if > there's simply a way to do this correctly I'd much prefer it. This > basically > doubles the time to write parquet files to s3. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >