Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Arkadiusz Bicz
Why do you need to be only one file? Spark doing good job writing in many files. On Fri, Jan 15, 2016 at 7:48 AM, Patrick McGloin wrote: > Hi, > > I would like to reparation / coalesce my data so that it is saved into one > Parquet file per partition. I would also like

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Cheng Lian
You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all data belonging to a single (data) partition into a single (RDD) partition: |df.coalesce(1)|||.repartition("entity", "year", "month", "day", "status")|.write.partitionBy("entity", "year", "month", "day",

Re: DataFrame partitionBy to a single Parquet file (per partition)

2016-01-15 Thread Patrick McGloin
I will try this in Monday. Thanks for the tip. On Fri, 15 Jan 2016, 18:58 Cheng Lian wrote: > You may try DataFrame.repartition(partitionExprs: Column*) to shuffle all > data belonging to a single (data) partition into a single (RDD) partition: > >

DataFrame partitionBy to a single Parquet file (per partition)

2016-01-14 Thread Patrick McGloin
Hi, I would like to reparation / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1).write.partitionBy("entity", "year", "month", "day",