[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

Steve Loughran (Jira) Thu, 05 Sep 2019 04:33:15 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923299#comment-16923299
 ]


Steve Loughran commented on SPARK-28945:
----------------------------------------

It's a core part of the hadoop MR commit protocols. I think the best (only!) 
docs of these other than the most confusing piece of co-recursive code I've 
ever had to step through taking notes of is : 
https://github.com/steveloughran/zero-rename-committer/releases/tag/tag_draft_005


every MR app attempt has its own attempt ID; when the hadoop MR engine attempt 
N is restarted it looks for the temp dir of N-1 and can use this to recover 
from failure. Spark's solution to the app restart problem is "be faster and fix 
failures by restarting entirely", so the app attempt is always 0

If you have two jobs writing to same destination path, their output is 
inevitably going to conflict and as the first job commit will delete the 
attempt dir then the second will fail. 

# You need to (somehow) get a different attempt ID for each job to avoid that 
clash.  
# jobs to set  "mapreduce.fileoutputcommitter.cleanup.skipped" to false to 
avoid a full cleanup of _temporary on job commit. That's got a risk of leaking 
temp files after job failures.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-28945
>                 URL: https://issues.apache.org/jira/browse/SPARK-28945
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>            Reporter: koert kuipers
>            Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

Reply via email to