Kazuyuki Tanimura created SPARK-56588:
-----------------------------------------
Summary: Dynamic partition overwrite
(partitionOverwriteMode=DYNAMIC) behaves as append for Spark 4.1+ and HDFS
Key: SPARK-56588
URL: https://issues.apache.org/jira/browse/SPARK-56588
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 4.1.1, 4.2.0, 4.1.2
Reporter: Kazuyuki Tanimura
spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for
Spark 4.1+ and HDFS. Partition overwrites behave as appends, old data is
preserved alongside new data in the same partition directory, resulting in
duplicate rows. This affects both the session config path (spark.conf.set())
and the inline write option path (df.write.option())
This looks related to https://issues.apache.org/jira/browse/SPARK-54248
Setting
—conf
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
—conf
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
Solves the problem.
Spark automatically applies
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter /
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are
empty and spark-hadoop-cloud module is available.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]