Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Silvio Fiorito
-from-the-field-episode-ii-applying-best-practices-to-your-apache-spark-applications-with-silvio-fiorito From: Gourav Sengupta Date: Wednesday, July 10, 2019 at 3:14 AM To: Silvio Fiorito Cc: Arwin Tio , "user@spark.apache.org" Subject: Re: Parquet 'bucketBy' creates a ton of files

Re: Parquet 'bucketBy' creates a ton of files

2019-07-10 Thread Gourav Sengupta
/files. If the bucket keys are distributed > randomly across the RDD partitions, then you will get multiple files per > bucket. > > > > *From: *Arwin Tio > *Date: *Thursday, July 4, 2019 at 3:22 AM > *To: *"user@spark.apache.org" > *Subject: *Parquet 'bucketBy'

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Silvio Fiorito
;user@spark.apache.org" Subject: Parquet 'bucketBy' creates a ton of files I am trying to use Spark's **bucketBy** feature on a pretty large dataset. ```java dataframe.write() .format("parquet") .bucketBy(500, bucketColumn1, bucketColumn2) .mode(SaveMode.Overwrite)

Re: Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Phillip Henry
Hi, Arwin. If I understand you correctly, this is totally expected behaviour. I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the

Parquet 'bucketBy' creates a ton of files

2019-07-04 Thread Arwin Tio
I am trying to use Spark's **bucketBy** feature on a pretty large dataset. ```java dataframe.write() .format("parquet") .bucketBy(500, bucketColumn1, bucketColumn2) .mode(SaveMode.Overwrite) .option("path", "s3://my-bucket") .saveAsTable("my_table"); ``` The problem is that