Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Igor Berman Sat, 05 Mar 2016 11:15:06 -0800

it's not safe to use direct committer with append mode, you may loose your
data..


On 4 March 2016 at 22:59, Jelez Raditchkov <je...@hotmail.com> wrote:

> Working on a streaming job with DirectParquetOutputCommitter to S3
> I need to use PartitionBy and hence SaveMode.Append
>
> Apparently when using SaveMode.Append spark automatically defaults to the
> default parquet output committer and ignores DirectParquetOutputCommitter.
>
> My problems are:
> 1. the copying to _temporary takes alot of time
> 2. I get job failures with: java.io.FileNotFoundException: File
> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does
> not exist.
>
> I have set:
>         sparkConfig.set("spark.speculation", "false")
>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false")
>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative",
> "false")
>
> Any ideas? Opinions? Best practices?
>
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to