[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Steve Loughran (Jira) Tue, 23 Mar 2021 10:21:06 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307252#comment-17307252
 ]


Steve Loughran commented on MAPREDUCE-7331:
-------------------------------------------

Does the spark version you have contain the fix  [SPARK-33402][CORE] Jobs 
launched in same second have duplicate MapReduce JobIDs ?

As that may the underlying problem: you have >1 stage reusing the same jobID, 
so are using the same job directory under _temporary.

Apply that fix first before worrying about going anywhere near 
FileOutputCommitter. We are scared of changes there as it is a critical part of 
so many applications. 

> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7331
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 3.0.0
>         Environment: CDH 6.2.1 Hadoop 3.0.0
>            Reporter: Bimalendu Choudhary
>            Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7331) Make temporary directory used by FileOutputCommitter configurable

Reply via email to