Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Arkadiusz Bicz
Why do you need to be only one file? Spark doing good job writing in
many files.

On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin
 wrote:
> Hi,
>
> I would like to reparation / coalesce my data so that it is saved into one
> Parquet file per partition. I would also like to use the Spark SQL
> partitionBy API. So I could do that like this:
>
> df.coalesce(1).write.partitionBy("entity", "year", "month", "day",
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> I've tested this and it doesn't seem to perform well. This is because there
> is only one partition to work on in the dataset and all the partitioning,
> compression and saving of files has to be done by one CPU core.
>
> I could rewrite this to do the partitioning manually (using filter with the
> distinct partition values for example) before calling coalesce.
>
> But is there a better way to do this using the standard Spark SQL API?
>
> Best regards,
>
> Patrick
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle 
all data belonging to a single (data) partition into a single (RDD) 
partition:


|df.coalesce(1)|||.repartition("entity", "year", "month", "day", 
"status")|.write.partitionBy("entity", "year", "month", "day", 
"status").mode(SaveMode.Append).parquet(s"$location")|


(Unfortunately the naming here can be quite confusing.)

Cheng

On 1/14/16 11:48 PM, Patrick McGloin wrote:

Hi,

I would like to reparation / coalesce my data so that it is saved into 
one Parquet file per partition. I would also like to use the Spark SQL 
partitionBy API. So I could do that like this:


|df.coalesce(1).write.partitionBy("entity", "year", "month", "day", 
"status").mode(SaveMode.Append).parquet(s"$location") |


I've tested this and it doesn't seem to perform well. This is because 
there is only one partition to work on in the dataset and all the 
partitioning, compression and saving of files has to be done by one 
CPU core.


I could rewrite this to do the partitioning manually (using filter 
with the distinct partition values for example) before calling coalesce.


But is there a better way to do this using the standard Spark SQL API?

Best regards,

Patrick






Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Patrick McGloin
I will try this in Monday. Thanks for the tip.

On Fri, 15 Jan 2016, 18:58 Cheng Lian  wrote:

> You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all
> data belonging to a single (data) partition into a single (RDD) partition:
>
> df.coalesce(1).repartition("entity", "year", "month", "day", 
> "status").write.partitionBy("entity", "year", "month", "day", 
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> (Unfortunately the naming here can be quite confusing.)
>
>
> Cheng
>
>
> On 1/14/16 11:48 PM, Patrick McGloin wrote:
>
> Hi,
>
> I would like to reparation / coalesce my data so that it is saved into one
> Parquet file per partition. I would also like to use the Spark SQL
> partitionBy API. So I could do that like this:
>
> df.coalesce(1).write.partitionBy("entity", "year", "month", "day", 
> "status").mode(SaveMode.Append).parquet(s"$location")
>
> I've tested this and it doesn't seem to perform well. This is because
> there is only one partition to work on in the dataset and all the
> partitioning, compression and saving of files has to be done by one CPU
> core.
>
> I could rewrite this to do the partitioning manually (using filter with
> the distinct partition values for example) before calling coalesce.
>
> But is there a better way to do this using the standard Spark SQL API?
>
> Best regards,
>
> Patrick
>
>
>
>


DataFrame partitionBy to a single Parquet file (per partition)

2016-01-14 Thread Patrick McGloin
Hi,

I would like to reparation / coalesce my data so that it is saved into one
Parquet file per partition. I would also like to use the Spark SQL
partitionBy API. So I could do that like this:

df.coalesce(1).write.partitionBy("entity", "year", "month", "day",
"status").mode(SaveMode.Append).parquet(s"$location")

I've tested this and it doesn't seem to perform well. This is because there
is only one partition to work on in the dataset and all the partitioning,
compression and saving of files has to be done by one CPU core.

I could rewrite this to do the partitioning manually (using filter with the
distinct partition values for example) before calling coalesce.

But is there a better way to do this using the standard Spark SQL API?

Best regards,

Patrick