Franklyn Dsouza created SPARK-14117:
---------------------------------------

             Summary: write.partitionBy retains partitioning column when 
outputting Parquet
                 Key: SPARK-14117
                 URL: https://issues.apache.org/jira/browse/SPARK-14117
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.6.1
            Reporter: Franklyn Dsouza
            Priority: Minor


When writing a Dataframe as parquet using a partitionBy on the writer to 
generate multiple output folders, the resulting parquet files have columns 
containing the partitioning column.

Here's a simple example:

df = sc.sql.createDataFrame([
  Row(a="folder 1 message 1", folder="folder1"),
  Row(a="folder 1 message 2", folder="folder1"),
  Row(a="folder 1 message 3", folder="folder1"),
  Row(a="folder 2 message 1", folder="folder2"),
  Row(a="folder 2 message 2", folder="folder2"),
  Row(a="folder 2 message 3", folder="folder2"),
])

df.write.partitionBy('folder').parquet('output')

produces the following output :-

+------------------+---------------+
|                            a|   folder|
+------------------+---------------+
|folder 2 message 1|folder2|
+------------------+---------------+

whereas df.write.partitionBy('folder').json('output')

produces :-


{"a":"folder 2 message 1"}

without the partitioning column.

I'm assuming this is a bug because of the different behaviour between the two. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to