[ 
https://issues.apache.org/jira/browse/SPARK-56588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-56588:
--------------------------------------
    Description: 
spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for 
Spark  4.1+ and HDFS. Partition overwrites behave as appends, old data is 
preserved alongside new data in the same partition directory, resulting in 
duplicate rows. This affects both the session config path (spark.conf.set()) 
and the inline write option path (df.write.option())

This looks related to https://issues.apache.org/jira/browse/SPARK-54248

Setting 

 —conf 
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
 —conf 
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
 

Solves the problem.

Spark automatically applies 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter / 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are 
empty and spark-hadoop-cloud module is available. 

Magic committer itself may not be a  problem, but silently breaking 
spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal. 

  was:
spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for 
Spark  4.1+ and HDFS. Partition overwrites behave as appends, old data is 
preserved alongside new data in the same partition directory, resulting in 
duplicate rows. This affects both the session config path (spark.conf.set()) 
and the inline write option path (df.write.option())

This looks related to https://issues.apache.org/jira/browse/SPARK-54248

Setting 

 —conf 
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
 —conf 
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
 

Solves the problem.

Spark automatically applies 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter / 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are 
empty and spark-hadoop-cloud module is available. 


> Dynamic partition overwrite (partitionOverwriteMode=DYNAMIC) behaves as 
> append for Spark 4.1+ and HDFS 
> -------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-56588
>                 URL: https://issues.apache.org/jira/browse/SPARK-56588
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 4.2.0, 4.1.1, 4.1.2
>            Reporter: Kazuyuki Tanimura
>            Priority: Major
>
> spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected 
> for Spark  4.1+ and HDFS. Partition overwrites behave as appends, old data is 
> preserved alongside new data in the same partition directory, resulting in 
> duplicate rows. This affects both the session config path (spark.conf.set()) 
> and the inline write option path (df.write.option())
> This looks related to https://issues.apache.org/jira/browse/SPARK-54248
> Setting 
>  —conf 
> spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
>  —conf 
> spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
>  
> Solves the problem.
> Spark automatically applies 
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter / 
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are 
> empty and spark-hadoop-cloud module is available. 
> Magic committer itself may not be a  problem, but silently breaking 
> spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to