[ https://issues.apache.org/jira/browse/SPARK-14117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Franklyn Dsouza closed SPARK-14117. ----------------------------------- Resolution: Fixed > write.partitionBy retains partitioning column when outputting Parquet > --------------------------------------------------------------------- > > Key: SPARK-14117 > URL: https://issues.apache.org/jira/browse/SPARK-14117 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.1 > Reporter: Franklyn Dsouza > Priority: Minor > > When writing a Dataframe as parquet using a partitionBy on the writer to > generate multiple output folders, the resulting parquet files have columns > containing the partitioning column. > Here's a simple example: > {code} > df = sql.createDataFrame([ > Row(a="folder 1 message 1", folder="folder1"), > Row(a="folder 1 message 2", folder="folder1"), > Row(a="folder 1 message 3", folder="folder1"), > Row(a="folder 2 message 1", folder="folder2"), > Row(a="folder 2 message 2", folder="folder2"), > Row(a="folder 2 message 3", folder="folder2"), > ]) > df.write.partitionBy('folder').parquet('output') > {code} > produces the following output :- > {code} > +------------------+-------+ > | a| folder| > +------------------+-------+ > |folder 2 message 1|folder2| > +------------------+-------+ > {code} > whereas > {code} > df.write.partitionBy('folder').json('output') > {code} > produces :- > {code} > {"a":"folder 2 message 1"} > {code} > without the partitioning column. > I'm assuming this is a bug because of the different behaviour between the > two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org