Hi, Arwin. If I understand you correctly, this is totally expected behaviour.
I don't know much about saving to S3 but maybe you could write to HDFS first then copy everything to S3? I think the write to HDFS will probably be much faster as Spark/HDFS will write locally or to a machine on the same LAN. After writing to HDFS, you can then iterate over the resulting sub-directories (representing each bucket) and coalesce the files in them. Regards, Phillip On Thu, Jul 4, 2019 at 8:22 AM Arwin Tio <arwin....@hotmail.com> wrote: > I am trying to use Spark's **bucketBy** feature on a pretty large dataset. > > ```java > dataframe.write() > .format("parquet") > .bucketBy(500, bucketColumn1, bucketColumn2) > .mode(SaveMode.Overwrite) > .option("path", "s3://my-bucket") > .saveAsTable("my_table"); > ``` > > The problem is that my Spark cluster has about 500 > partitions/tasks/executors (not sure the terminology), so I end up with > files that look like: > > ``` > part-00001-{UUID}_00001.c000.snappy.parquet > part-00001-{UUID}_00002.c000.snappy.parquet > ... > part-00001-{UUID}_00500.c000.snappy.parquet > > part-00002-{UUID}_00001.c000.snappy.parquet > part-00002-{UUID}_00002.c000.snappy.parquet > ... > part-00002-{UUID}_00500.c000.snappy.parquet > > part-00500-{UUID}_00001.c000.snappy.parquet > part-00500-{UUID}_00002.c000.snappy.parquet > ... > part-00500-{UUID}_00500.c000.snappy.parquet > ``` > > That's 500x500=250000 bucketed parquet files! It takes forever for the > `FileOutputCommitter` to commit that to S3. > > Is there a way to generate **one file per bucket**, like in Hive? Or is > there a better way to deal with this problem? As of now it seems like I > have to choose between lowering the parallelism of my cluster (reduce > number of writers) or reducing the parallelism of my parquet files (reduce > number of buckets), which will lower the parallelism of my downstream jobs. > > Thanks >