Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

Tony Zhang Tue, 20 Jul 2021 08:47:58 -0700

Hi,


My question is specifically about PR #29000
<https://github.com/apache/spark/pull/29000/files#r649580767> for
SPARK-29302 <https://issues.apache.org/jira/browse/SPARK-29302>.

To my understanding, the PR is to introduce a different staging directory
at job commit to avoid commit collision. In
SQLHadoopMapReduceCommitProtocol, the new staging directory is only set
when SQLConf.OUTPUT_COMMITTER_CLASS is not null: code
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58>,
and in current Spark repo, OUTPUT_COMMITTER_CLASS is set only for parquet
formats: code
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96>.
I didn't find similar behavior in Orc related code to set that config.

If I understand it correctly, without setting
SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc
format), SQLHadoopMapReduceCommitProtocol will still use the original
staging directory, which may void the fix by the PR, in which case the
commit collision may still happen, thus the fix now seems only effective
for Parquet, but not for non-Parquet files.

Could someone confirm if it is a potential problem or not? Am I missing
something here? Thanks!

Best Regards,
Tony Zhang

Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

Reply via email to