With Hadoop 2.7 or later, set
spark.hadooop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true
This switches to a no -rename version of the file output committer, is faster
all round. You are still at risk of things going wrong on
afaik no.
// maropu
On Thu, Aug 25, 2016 at 9:16 PM, Tal Grynbaum
wrote:
> Is/was there an option similar to DirectParquetOutputCommitter to write
> json files to S3 ?
>
> On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> Seems this just prevents writers from leaving pa
Is/was there an option similar to DirectParquetOutputCommitter to write
json files to S3 ?
On Thu, Aug 25, 2016 at 2:56 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> Seems this just prevents writers from leaving partial data in a
> destination dir when jobs fail.
> In the previous versions of Spark, the
Hi,
Seems this just prevents writers from leaving partial data in a destination
dir when jobs fail.
In the previous versions of Spark, there was a way to directly write data
in a destination though,
Spark v2.0+ has no way to do that because of the critial issue on S3 (See:
SPARK-10063).
// maropu
I read somewhere that its because s3 has to know the size of the file
upfront
I dont really understand this, as to why is it ok not to know it for the
temp files and not ok for the final files.
The delete permission is the minor disadvantage from my side, the worst
thing is that i have a cluster
Hi
When Spark saves anything to S3 it creates temporary files. Why? Asking
this as this requires the the access credentails to be given
delete permissions along with write permissions.