[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

Tony Zhang (Jira) Fri, 16 Jul 2021 15:59:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tony Zhang updated SPARK-36187:
-------------------------------
    Description: 
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code to set that config. 
If I understand it correctly, without setting SQLConf.OUTPUT_COMMITTER_CLASS 
properly (like for Orc format), SQLHadoopMapReduceCommitProtocol will still use 
the original staging directory, which may void the fix by the PR, in which case 
the commit collision may still happen, thus the fix is now only effective for 
Parquet, but not for non-Parquet files.

Could someone confirm if it is a potential problem, or not? Thanks!

[~duripeng] [~dagrawal3409]

  was:
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!

[~duripeng] [~dagrawal3409]


> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-36187
>                 URL: https://issues.apache.org/jira/browse/SPARK-36187
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: Tony Zhang
>            Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

Reply via email to