Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Ted Yu Sun, 06 Mar 2016 03:48:53 -0800

Gourav:
For the 3rd paragraph, did you mean the job seemed to be idle for about 5 
minutes ?


Cheers

> On Mar 6, 2016, at 3:35 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> Hi,
> 
> This is a solved problem, try using s3a instead and everything will be fine.
> 
> Besides that you might want to use coalesce or  partitionby or repartition in 
> order to see how many executors are being used to write (that speeds things 
> up quite a bit).
> 
> We had a write issue taking close to 50 min which is not running for lower 
> than 5 minutes.
> 
> 
> Regards,
> Gourav Sengupta 
> 
>> On Fri, Mar 4, 2016 at 8:59 PM, Jelez Raditchkov <je...@hotmail.com> wrote:
>> Working on a streaming job with DirectParquetOutputCommitter to S3
>> I need to use PartitionBy and hence SaveMode.Append
>> 
>> Apparently when using SaveMode.Append spark automatically defaults to the 
>> default parquet output committer and ignores DirectParquetOutputCommitter.
>> 
>> My problems are:
>> 1. the copying to _temporary takes alot of time
>> 2. I get job failures with: java.io.FileNotFoundException: File 
>> s3n://jelez/parquet-data/_temporary/0/task_201603040904_0544_m_000007 does 
>> not exist.
>> 
>> I have set:
>>         sparkConfig.set("spark.speculation", "false")
>>         sc.hadoopConfiguration.set("mapreduce.map.speculative", "false") 
>>         sc.hadoopConfiguration.set("mapreduce.reduce.speculative", "false") 
>> 
>> Any ideas? Opinions? Best practices?
>

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

Reply via email to