-from-the-field-episode-ii-applying-best-practices-to-your-apache-spark-applications-with-silvio-fiorito
From: Gourav Sengupta
Date: Wednesday, July 10, 2019 at 3:14 AM
To: Silvio Fiorito
Cc: Arwin Tio , "user@spark.apache.org"
Subject: Re: Parquet 'bucketBy' creates a ton of files
/files. If the bucket keys are distributed
> randomly across the RDD partitions, then you will get multiple files per
> bucket.
>
>
>
> *From: *Arwin Tio
> *Date: *Thursday, July 4, 2019 at 3:22 AM
> *To: *"user@spark.apache.org"
> *Subject: *Parquet 'bucketBy'
;user@spark.apache.org"
Subject: Parquet 'bucketBy' creates a ton of files
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
Hi, Arwin.
If I understand you correctly, this is totally expected behaviour.
I don't know much about saving to S3 but maybe you could write to HDFS
first then copy everything to S3? I think the write to HDFS will probably
be much faster as Spark/HDFS will write locally or to a machine on the
I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
```java
dataframe.write()
.format("parquet")
.bucketBy(500, bucketColumn1, bucketColumn2)
.mode(SaveMode.Overwrite)
.option("path", "s3://my-bucket")
.saveAsTable("my_table");
```
The problem is that