Had the same issue my self. I was surprised at first as well, but I found it useful - the amount of data saved for each partition has decreased. When I load the data from each partition, I add the partitioned columns with <code>lit</code> function before I merge the frames from the different partitions.
On Tue, Jun 5, 2018 at 5:44 AM, Jay <jayadeep.jayara...@gmail.com> wrote: > The partitionBy clause is used to create hive folders so that you can > point a hive partitioned table on the data . > > What are you using the partitionBy for ? What is the use case ? > > > On Mon 4 Jun, 2018, 4:59 PM purna pradeep, <purna2prad...@gmail.com> > wrote: > >> im reading below json in spark >> >> {"bucket": "B01", "actionType": "A1", "preaction": "NULL", >> "postaction": "NULL"} >> {"bucket": "B02", "actionType": "A2", "preaction": "NULL", >> "postaction": "NULL"} >> {"bucket": "B03", "actionType": "A3", "preaction": "NULL", >> "postaction": "NULL"} >> >> val df=spark.read.json("actions.json").toDF() >> >> Now im writing the same to a json output as below >> >> df.write. format("json"). mode("append"). >> partitionBy("bucket","actionType"). >> save("output.json") >> >> >> and the output.json is as below >> >> {"preaction":"NULL","postaction":"NULL"} >> >> bucket,actionType columns are missing in the json output, i need >> partitionby columns as well in the output >> >>