Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Takeshi Yamamuro
I got this info. from a hadoop jira ticket: https://issues.apache.org/jira/browse/MAPREDUCE-5485 // maropu On Sat, Oct 1, 2016 at 7:14 PM, Igor Berman wrote: > Takeshi, why are you saying this, how have you checked it's only used from > 2.7.3? > We use spark 2.0 which is

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Igor Berman
Takeshi, why are you saying this, how have you checked it's only used from 2.7.3? We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we use this setting. We've sort of "verified" it's used by configuring log of file output commiter On 30 September 2016 at 03:12, Takeshi

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-09-29 Thread Takeshi Yamamuro
Hi, FYI: Seems `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)` is only available at hadoop-2.7.3+. // maropu On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal wrote: > You can use partition explicitly by adding "/=" > to > the end of the

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-09-29 Thread joffe.tal
You can use partition explicitly by adding "/=" to the end of the path you are writing to and then use overwrite. BTW in Spark 2.0 you just need to use: sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”) and use s3a:// and you can work with regular output

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
Thanks for the clarification, Gourav. > On Mar 6, 2016, at 3:54 AM, Gourav Sengupta wrote: > > Hi Ted, > > There was no idle time after I changed the path to start with s3a and then > ensured that the number of executors writing were large. The writes start and >

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
Hi Ted, There was no idle time after I changed the path to start with s3a and then ensured that the number of executors writing were large. The writes start and complete in about 5 mins or less. Initially the write used to complete by around 30 mins and we could see that there were failure

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Ted Yu
Gourav: For the 3rd paragraph, did you mean the job seemed to be idle for about 5 minutes ? Cheers > On Mar 6, 2016, at 3:35 AM, Gourav Sengupta wrote: > > Hi, > > This is a solved problem, try using s3a instead and everything will be fine. > > Besides that you

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-06 Thread Gourav Sengupta
Hi, This is a solved problem, try using s3a instead and everything will be fine. Besides that you might want to use coalesce or partitionby or repartition in order to see how many executors are being used to write (that speeds things up quite a bit). We had a write issue taking close to 50 min

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-05 Thread Igor Berman
it's not safe to use direct committer with append mode, you may loose your data.. On 4 March 2016 at 22:59, Jelez Raditchkov wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when

S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-04 Thread Jelez Raditchkov
Working on a streaming job with DirectParquetOutputCommitter to S3I need to use PartitionBy and hence SaveMode.Append Apparently when using SaveMode.Append spark automatically defaults to the default parquet output committer and ignores DirectParquetOutputCommitter. My problems are:1. the