Franklyn Dsouza created SPARK-14117: ---------------------------------------
Summary: write.partitionBy retains partitioning column when outputting Parquet Key: SPARK-14117 URL: https://issues.apache.org/jira/browse/SPARK-14117 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Franklyn Dsouza Priority: Minor When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column. Here's a simple example: df = sc.sql.createDataFrame([ Row(a="folder 1 message 1", folder="folder1"), Row(a="folder 1 message 2", folder="folder1"), Row(a="folder 1 message 3", folder="folder1"), Row(a="folder 2 message 1", folder="folder2"), Row(a="folder 2 message 2", folder="folder2"), Row(a="folder 2 message 3", folder="folder2"), ]) df.write.partitionBy('folder').parquet('output') produces the following output :- +------------------+---------------+ | a| folder| +------------------+---------------+ |folder 2 message 1|folder2| +------------------+---------------+ whereas df.write.partitionBy('folder').json('output') produces :- {"a":"folder 2 message 1"} without the partitioning column. I'm assuming this is a bug because of the different behaviour between the two. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org