Chenyu Zheng created SPARK-54003:
------------------------------------
Summary: Use the staging directory as the output directory before
job commit
Key: SPARK-54003
URL: https://issues.apache.org/jira/browse/SPARK-54003
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.0
Reporter: Chenyu Zheng
SparkSQL uses the partition location or table location as the commit path
(except in *_dynamic partition overwrite_* mode and *_custom partition path_*
mode). This has at least the following issues:
* As described in SPARK-37210, conflicts can occur when multiple partitions job
of the same table are run concurrently. Using a staging directory can avoid
this issue.
* As described in SPARK-53937, using a staging directory allows for near-atomic
operations.
_*Dynamic partition overwrite*_ mode and *_custom partition path_* mode already
use the staging directory. And *_dynamic partition overwrite_* mode and
_*custom partition path*_ are implemented differently, which can be further
simplified into a unified process. And in
https://github.com/apache/spark/pull/29000, reset the staging directory as the
output directory of FileOutputCommitter. This way is more safer. It should be
modified to this way.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]