Had the same issue my self. I was surprised at first as well, but I found
it useful - the amount of data saved for each partition has decreased.
When I load the data from each partition, I add the partitioned columns
with <code>lit</code> function before I merge the frames from the
different partitions.

On Tue, Jun 5, 2018 at 5:44 AM, Jay <jayadeep.jayara...@gmail.com> wrote:

> The partitionBy clause is used to create hive folders so that you can
> point a hive partitioned table on the data .
>
> What are you using the partitionBy for ? What is the use case ?
>
>
> On Mon 4 Jun, 2018, 4:59 PM purna pradeep, <purna2prad...@gmail.com>
> wrote:
>
>> im reading below json in spark
>>
>>     {"bucket": "B01", "actionType": "A1", "preaction": "NULL",
>> "postaction": "NULL"}
>>     {"bucket": "B02", "actionType": "A2", "preaction": "NULL",
>> "postaction": "NULL"}
>>     {"bucket": "B03", "actionType": "A3", "preaction": "NULL",
>> "postaction": "NULL"}
>>
>>     val df=spark.read.json("actions.json").toDF()
>>
>> Now im writing the same to a json output as below
>>
>>     df.write. format("json"). mode("append"). 
>> partitionBy("bucket","actionType").
>> save("output.json")
>>
>>
>> and the output.json is as below
>>
>>     {"preaction":"NULL","postaction":"NULL"}
>>
>> bucket,actionType columns are missing in the json output, i need
>> partitionby columns as well in the output
>>
>>

Reply via email to