Hi, Arwin.

If I understand you correctly, this is totally expected behaviour.

I don't know much about saving to S3 but maybe you could write to HDFS
first then copy everything to S3? I think the write to HDFS will probably
be much faster as Spark/HDFS will write locally or to a machine on the same
LAN. After writing to HDFS, you can then iterate over the resulting
sub-directories (representing each bucket) and coalesce the files in them.

Regards,

Phillip




On Thu, Jul 4, 2019 at 8:22 AM Arwin Tio <arwin....@hotmail.com> wrote:

> I am trying to use Spark's **bucketBy** feature on a pretty large dataset.
>
> ```java
> dataframe.write()
>     .format("parquet")
>     .bucketBy(500, bucketColumn1, bucketColumn2)
>     .mode(SaveMode.Overwrite)
>     .option("path", "s3://my-bucket")
>     .saveAsTable("my_table");
> ```
>
> The problem is that my Spark cluster has about 500
> partitions/tasks/executors (not sure the terminology), so I end up with
> files that look like:
>
> ```
> part-00001-{UUID}_00001.c000.snappy.parquet
> part-00001-{UUID}_00002.c000.snappy.parquet
> ...
> part-00001-{UUID}_00500.c000.snappy.parquet
>
> part-00002-{UUID}_00001.c000.snappy.parquet
> part-00002-{UUID}_00002.c000.snappy.parquet
> ...
> part-00002-{UUID}_00500.c000.snappy.parquet
>
> part-00500-{UUID}_00001.c000.snappy.parquet
> part-00500-{UUID}_00002.c000.snappy.parquet
> ...
> part-00500-{UUID}_00500.c000.snappy.parquet
> ```
>
> That's 500x500=250000 bucketed parquet files! It takes forever for the
> `FileOutputCommitter` to commit that to S3.
>
> Is there a way to generate **one file per bucket**, like in Hive? Or is
> there a better way to deal with this problem? As of now it seems like I
> have to choose between lowering the parallelism of my cluster (reduce
> number of writers) or reducing the parallelism of my parquet files (reduce
> number of buckets), which will lower the parallelism of my downstream jobs.
>
> Thanks
>

Reply via email to