it's not safe to use direct committer with append mode, you may loose your data..
On 4 March 2016 at 22:59, Jelez Raditchkov <je...@hotmail.com> wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when using SaveMode.Append spark automatically defaults to the > default parquet output committer and ignores DirectParquetOutputCommitter. > > My problems are: > 1. the copying to _temporary takes alot of time > 2. I get job failures with: java.io.FileNotFoundException: File > s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does > not exist. > > I have set: > sparkConfig.set("spark.speculation", "false") > sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") > sc.hadoopConfiguration.set("mapreduce.reduce.speculative", > "false") > > Any ideas? Opinions? Best practices? > >