[ https://issues.apache.org/jira/browse/SPARK-31968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135316#comment-17135316 ]
Dongjoon Hyun commented on SPARK-31968: --------------------------------------- Hi, [~hyukjin.kwon]. [~qxzzxq] is TJX2014? > write.partitionBy() creates duplicate subdirectories when user provides > duplicate columns > ----------------------------------------------------------------------------------------- > > Key: SPARK-31968 > URL: https://issues.apache.org/jira/browse/SPARK-31968 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.6 > Reporter: Xuzhou Qin > Assignee: Xuzhou Qin > Priority: Major > Fix For: 3.0.1, 3.1.0, 2.4.7 > > > I recently remarked that if there are duplicated elements in the argument of > write.partitionBy(), then the same partition subdirectory will be created > multiple times. > For example: > {code:java} > import spark.implicits._ > val df: DataFrame = Seq( > (1, "p1", "c1", 1L), > (2, "p2", "c2", 2L), > (2, "p1", "c2", 2L), > (3, "p3", "c3", 3L), > (3, "p2", "c3", 3L), > (3, "p3", "c3", 3L) > ).toDF("col1", "col2", "col3", "col4") > df.write > .partitionBy("col1", "col1") // we have "col1" twice > .mode(SaveMode.Overwrite) > .csv("output_dir"){code} > The above code will produce an output directory with this structure: > > {code:java} > output_dir > | > |--col1=1 > | |--col1=1 > | > |--col1=2 > | |--col1=2 > | > |--col1=3 > |--col1=3{code} > And we won't be able to read the output > > {code:java} > spark.read.csv("output_dir").show() > // Exception in thread "main" org.apache.spark.sql.AnalysisException: Found > duplicate column(s) in the partition schema: `col1`;{code} > > I am not sure if partitioning a dataframe twice by the same column make sense > in some real-world applications, but it will cause schema inference problems > in tools like AWS Glue crawler. > Should Spark handle the deduplication of the partition columns? Or maybe > throw an exception when duplicated columns are detected? > If this behaviour is unexpected, I will work on a fix. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org