Re: DataFrame partitionBy to a single Parquet file (per partition)

Arkadiusz Bicz Fri, 15 Jan 2016 02:44:48 -0800

Why do you need to be only one file? Spark doing good job writing in
many files.


On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin
<mcgloin.patr...@gmail.com> wrote:
> Hi,
>
> I would like to reparation / coalesce my data so that it is saved into one
> Parquet file per partition. I would also like to use the Spark SQL
> partitionBy API. So I could do that like this:
>
> df.coalesce(1).write.partitionBy("entity", "year", "month", "day",
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> I've tested this and it doesn't seem to perform well. This is because there
> is only one partition to work on in the dataset and all the partitioning,
> compression and saving of files has to be done by one CPU core.
>
> I could rewrite this to do the partitioning manually (using filter with the
> distinct partition values for example) before calling coalesce.
>
> But is there a better way to do this using the standard Spark SQL API?
>
> Best regards,
>
> Patrick
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DataFrame partitionBy to a single Parquet file (per partition)

Reply via email to