[ 
https://issues.apache.org/jira/browse/SPARK-14117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franklyn Dsouza closed SPARK-14117.
-----------------------------------
    Resolution: Fixed

> write.partitionBy retains partitioning column when outputting Parquet
> ---------------------------------------------------------------------
>
>                 Key: SPARK-14117
>                 URL: https://issues.apache.org/jira/browse/SPARK-14117
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.1
>            Reporter: Franklyn Dsouza
>            Priority: Minor
>
> When writing a Dataframe as parquet using a partitionBy on the writer to 
> generate multiple output folders, the resulting parquet files have columns 
> containing the partitioning column.
> Here's a simple example:
> {code}
> df = sql.createDataFrame([
>   Row(a="folder 1 message 1", folder="folder1"),
>   Row(a="folder 1 message 2", folder="folder1"),
>   Row(a="folder 1 message 3", folder="folder1"),
>   Row(a="folder 2 message 1", folder="folder2"),
>   Row(a="folder 2 message 2", folder="folder2"),
>   Row(a="folder 2 message 3", folder="folder2"),
> ])
> df.write.partitionBy('folder').parquet('output')
> {code}
> produces the following output :-
> {code}
> +------------------+-------+
> |                 a| folder|
> +------------------+-------+
> |folder 2 message 1|folder2|
> +------------------+-------+
> {code}
> whereas 
> {code}
> df.write.partitionBy('folder').json('output')
> {code}
> produces :-
> {code}
> {"a":"folder 2 message 1"}
> {code}
> without the partitioning column.
> I'm assuming this is a bug because of the different behaviour between the 
> two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to