Kazuyuki Tanimura created SPARK-56588:
-----------------------------------------

             Summary: Dynamic partition overwrite 
(partitionOverwriteMode=DYNAMIC) behaves as append for Spark 4.1+ and HDFS 
                 Key: SPARK-56588
                 URL: https://issues.apache.org/jira/browse/SPARK-56588
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 4.1.1, 4.2.0, 4.1.2
            Reporter: Kazuyuki Tanimura


spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for 
Spark  4.1+ and HDFS. Partition overwrites behave as appends, old data is 
preserved alongside new data in the same partition directory, resulting in 
duplicate rows. This affects both the session config path (spark.conf.set()) 
and the inline write option path (df.write.option())

This looks related to https://issues.apache.org/jira/browse/SPARK-54248

Setting 

 —conf 
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
 —conf 
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
 

Solves the problem.

Spark automatically applies 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter / 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are 
empty and spark-hadoop-cloud module is available. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to