[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Steve Loughran (Jira) Mon, 03 May 2021 04:07:19 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338320#comment-17338320
 ]


Steve Loughran commented on MAPREDUCE-7331:
-------------------------------------------

yeah, you are right, looks like it gets rid of _temporary

However, 

# The V1 commit algorithm is not designed to commit work concurrently. There's 
a big assumption about rename vs merging subdirectories which is that "if a 
directory does not exist in the destination, then we can rename a task 
attempt's directory in immediately". If you have more than one job committing 
here the two may clash and the output would be "undefined".
# The v2 commit algorithm does handle concurrent promotion of task attempt data 
to task attempt, but it isn't resilient to failures during task commit. So 
differently flawed.

I'm not going to put any changes into the FileOutputCommitter because its such 
a critical piece of code that to go near it is risky.

I am adding a new committer in in MAPREDUCE-7341.
I add the option to only delete its job attempt on the basis that it's not 
going to support multiple job attempts.

Oh, and it should be safe to commit to different partitions, excluding the 
special case "job 1 is writing a file to what job 2 expects to be a directory"





> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7331
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 3.0.0
>         Environment: CDH 6.2.1 Hadoop 3.0.0
>            Reporter: Bimalendu Choudhary
>            Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Reply via email to