You can expect to see some fixes for this sort of issue in the medium term
future (multiple months, probably not years).

As Taylor notes, it's a Hadoop problem, not a spark problem.  So whichever
version of hadoop includes the fix will then wait for a spark release to
get built against it.  Last I checked they were targeting v3.0 for hadoop.

Other's have listed some middle-ware style fixes which we haven't tried.
We've just been writing to the local FS and then using boto to copy them
up.  Our use case has lots of slack in the timeliness though so although we
know it's an issue, it's not something that's a serious enough problem to
try to fix on our own at this point.

Gary

On 20 November 2017 at 12:56, Tayler Lawrence Jones <t.jonesd...@gmail.com>
wrote:

> It is an open issue with Hadoop file committer, not spark. The simple
> workaround is to write to hdfs then copy to s3. Netflix did a talk about
> their custom output committer at the last spark summit which is a clever
> efficient way of doing that - I’d check it out on YouTube. They have open
> sourced their implementation, but it does not work (out the box) with
> parquet.
>
> -TJ
>
> On Mon, Nov 20, 2017 at 11:48 Jim Carroll <jimfcarr...@gmail.com> wrote:
>
>> I have this exact issue. I was going to intercept the call in the
>> filesystem
>> if I had to (since we're using the S3 filesystem from Presto anyway) but
>> if
>> there's simply a way to do this correctly I'd much prefer it. This
>> basically
>> doubles the time to write parquet files to s3.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to