Re: DataFrame partitionBy to a single Parquet file (per partition)
Why do you need to be only one file? Spark doing good job writing in many files. On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloinwrote: > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like to use the Spark SQL > partitionBy API. So I could do that like this: > > df.coalesce(1).write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > I've tested this and it doesn't seem to perform well. This is because there > is only one partition to work on in the dataset and all the partitioning, > compression and saving of files has to be done by one CPU core. > > I could rewrite this to do the partitioning manually (using filter with the > distinct partition values for example) before calling coalesce. > > But is there a better way to do this using the standard Spark SQL API? > > Best regards, > > Patrick > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DataFrame partitionBy to a single Parquet file (per partition)
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location")| (Unfortunately the naming here can be quite confusing.) Cheng On 1/14/16 11:48 PM, Patrick McGloin wrote: Hi, I would like to reparation / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: |df.coalesce(1).write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location") | I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core. I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce. But is there a better way to do this using the standard Spark SQL API? Best regards, Patrick
Re: DataFrame partitionBy to a single Parquet file (per partition)
I will try this in Monday. Thanks for the tip. On Fri, 15 Jan 2016, 18:58 Cheng Lianwrote: > You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all > data belonging to a single (data) partition into a single (RDD) partition: > > df.coalesce(1).repartition("entity", "year", "month", "day", > "status").write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > (Unfortunately the naming here can be quite confusing.) > > > Cheng > > > On 1/14/16 11:48 PM, Patrick McGloin wrote: > > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like to use the Spark SQL > partitionBy API. So I could do that like this: > > df.coalesce(1).write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > I've tested this and it doesn't seem to perform well. This is because > there is only one partition to work on in the dataset and all the > partitioning, compression and saving of files has to be done by one CPU > core. > > I could rewrite this to do the partitioning manually (using filter with > the distinct partition values for example) before calling coalesce. > > But is there a better way to do this using the standard Spark SQL API? > > Best regards, > > Patrick > > > >
DataFrame partitionBy to a single Parquet file (per partition)
Hi, I would like to reparation / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1).write.partitionBy("entity", "year", "month", "day", "status").mode(SaveMode.Append).parquet(s"$location") I've tested this and it doesn't seem to perform well. This is because there is only one partition to work on in the dataset and all the partitioning, compression and saving of files has to be done by one CPU core. I could rewrite this to do the partitioning manually (using filter with the distinct partition values for example) before calling coalesce. But is there a better way to do this using the standard Spark SQL API? Best regards, Patrick