[ https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tony Zhang updated SPARK-36187: ------------------------------- Description: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code to set that config. If I understand it correctly, without setting SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), SQLHadoopMapReduceCommitProtocol will still use the original staging directory, which may void the fix by the PR, in which case the commit collision may still happen, thus the fix is now only effective for Parquet, but not for non-Parquet files. Could someone confirm if it is a potential problem, or not? Thanks! [~duripeng] [~dagrawal3409] was: Hi, my question here is specifically about [PR #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for SPARK-29302. To my understanding, the PR is to introduce a different staging directory at job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not null: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet formats: [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. However I didn't find similar behavior in Orc related code. Does it mean that this new staging directory will not take effect for non-Parquet formats? Could that be a potential problem? or am I missing something here? Thanks! [~duripeng] [~dagrawal3409] > Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet > formats > ------------------------------------------------------------------------------- > > Key: SPARK-36187 > URL: https://issues.apache.org/jira/browse/SPARK-36187 > Project: Spark > Issue Type: Question > Components: SQL > Affects Versions: 3.1.2 > Reporter: Tony Zhang > Priority: Minor > > Hi, my question here is specifically about [PR > #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for > SPARK-29302. > To my understanding, the PR is to introduce a different staging directory at > job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, > the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is > not null: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58], > and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet > formats: > [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96]. > However I didn't find similar behavior in Orc related code to set that > config. If I understand it correctly, without setting > SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), > SQLHadoopMapReduceCommitProtocol will still use the original staging > directory, which may void the fix by the PR, in which case the commit > collision may still happen, thus the fix is now only effective for Parquet, > but not for non-Parquet files. > Could someone confirm if it is a potential problem, or not? Thanks! > [~duripeng] [~dagrawal3409] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org