It is an open issue with Hadoop file committer, not spark. The simple
workaround is to write to hdfs then copy to s3. Netflix did a talk about
their custom output committer at the last spark summit which is a clever
efficient way of doing that - I’d check it out on YouTube. They have open
sourced their implementation, but it does not work (out the box) with
parquet.

-TJ

On Mon, Nov 20, 2017 at 11:48 Jim Carroll <jimfcarr...@gmail.com> wrote:

> I have this exact issue. I was going to intercept the call in the
> filesystem
> if I had to (since we're using the S3 filesystem from Presto anyway) but if
> there's simply a way to do this correctly I'd much prefer it. This
> basically
> doubles the time to write parquet files to s3.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to