[
https://issues.apache.org/jira/browse/SPARK-56588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kazuyuki Tanimura updated SPARK-56588:
--------------------------------------
Description:
spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for
Spark 4.1+ and HDFS. Partition overwrites behave as appends, old data is
preserved alongside new data in the same partition directory, resulting in
duplicate rows. This affects both the session config path (spark.conf.set())
and the inline write option path (df.write.option())
This looks related to https://issues.apache.org/jira/browse/SPARK-54248
Setting
—conf
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
—conf
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
Solves the problem.
Spark automatically applies
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter /
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are
empty and spark-hadoop-cloud module is available.
Magic committer itself may not be a problem, but silently breaking
spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal.
was:
spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected for
Spark 4.1+ and HDFS. Partition overwrites behave as appends, old data is
preserved alongside new data in the same partition directory, resulting in
duplicate rows. This affects both the session config path (spark.conf.set())
and the inline write option path (df.write.option())
This looks related to https://issues.apache.org/jira/browse/SPARK-54248
Setting
—conf
spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
—conf
spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
Solves the problem.
Spark automatically applies
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter /
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are
empty and spark-hadoop-cloud module is available.
> Dynamic partition overwrite (partitionOverwriteMode=DYNAMIC) behaves as
> append for Spark 4.1+ and HDFS
> -------------------------------------------------------------------------------------------------------
>
> Key: SPARK-56588
> URL: https://issues.apache.org/jira/browse/SPARK-56588
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 4.2.0, 4.1.1, 4.1.2
> Reporter: Kazuyuki Tanimura
> Priority: Major
>
> spark.sql.sources.partitionOverwriteMode=DYNAMIC does not work as expected
> for Spark 4.1+ and HDFS. Partition overwrites behave as appends, old data is
> preserved alongside new data in the same partition directory, resulting in
> duplicate rows. This affects both the session config path (spark.conf.set())
> and the inline write option path (df.write.option())
> This looks related to https://issues.apache.org/jira/browse/SPARK-54248
> Setting
> —conf
> spark.sql.parquet.output.committer.class=org.apache.parquet.hadoop.ParquetOutputCommitter
> —conf
> spark.sql.sources.commitProtocolClass=org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol
>
> Solves the problem.
> Spark automatically applies
> org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter /
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol if the confs are
> empty and spark-hadoop-cloud module is available.
> Magic committer itself may not be a problem, but silently breaking
> spark.sql.sources.partitionOverwriteMode=DYNAMIC behavior is not ideal.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]