Gourav: For the 3rd paragraph, did you mean the job seemed to be idle for about 5 minutes ?
Cheers > On Mar 6, 2016, at 3:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > This is a solved problem, try using s3a instead and everything will be fine. > > Besides that you might want to use coalesce or partitionby or repartition in > order to see how many executors are being used to write (that speeds things > up quite a bit). > > We had a write issue taking close to 50 min which is not running for lower > than 5 minutes. > > > Regards, > Gourav Sengupta > >> On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote: >> Working on a streaming job with DirectParquetOutputCommitter to S3 >> I need to use PartitionBy and hence SaveMode.Append >> >> Apparently when using SaveMode.Append spark automatically defaults to the >> default parquet output committer and ignores DirectParquetOutputCommitter. >> >> My problems are: >> 1. the copying to _temporary takes alot of time >> 2. I get job failures with: java.io.FileNotFoundException: File >> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does >> not exist. >> >> I have set: >> sparkConfig.set("spark.speculation", "false") >> sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") >> sc.hadoopConfiguration.set("mapreduce.reduce.speculative", "false") >> >> Any ideas? Opinions? Best practices? >