Why do you need to be only one file? Spark doing good job writing in many files.
On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin <mcgloin.patr...@gmail.com> wrote: > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like to use the Spark SQL > partitionBy API. So I could do that like this: > > df.coalesce(1).write.partitionBy("entity", "year", "month", "day", > "status").mode(SaveMode.Append).parquet(s"$location") > > I've tested this and it doesn't seem to perform well. This is because there > is only one partition to work on in the dataset and all the partitioning, > compression and saving of files has to be done by one CPU core. > > I could rewrite this to do the partitioning manually (using filter with the > distinct partition values for example) before calling coalesce. > > But is there a better way to do this using the standard Spark SQL API? > > Best regards, > > Patrick > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org