[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307365#comment-17307365
 ] 

Bimalendu Choudhary commented on MAPREDUCE-7331:
------------------------------------------------

The temporary files gets deleted at the end of the commitJob when we get the 
pendingjobAttemptPath and simply delete that path. So anything inside gets 
deleted.  I don't think that underlying attempt task attempt paths get deleted 
individually.  So in my case whether the other application had the same 
Mapreduce jobId or not, does not matter. Even if they share the same 
JObID/taskattempt path, they will be writing to different partition directories 
inside it. 

To me looks like on  application finishes first and ends up deleting the whole 
_temporary directory. For now the workaorund we are trying out is configuring 
not to delete the _temporary file at the end when we know that we have multiple 
spark application using the same directory.

In my case we are running multiple Spark application to process individual 
partition of the same table to make the processing fast. Since all are separate 
partitions so there is no chance of  data interference. But we end up getting 
FileNotFound exception.

 

 

 

> Make temporary directory used by FileOutputCommitter configurable
> -----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7331
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 3.0.0
>         Environment: CDH 6.2.1 Hadoop 3.0.0
>            Reporter: Bimalendu Choudhary
>            Priority: Major
>
> Spark SQL applications uses FileOutputCommitter to commit and merge its files 
> under a table directory. The hardcoded PENDING_DIR_NAME = _temporary 
> directory results in multiple application using the same temporary directory. 
> This casues unwanted results of one application interfering with other 
> applications temporary files. Also one application ending up deleting 
> temporary files of other. There is no way right now for applications to have 
> there unique path to store the temporary files to avoid any interference from 
> other totally independent applications.  I think the temporary directory 
> being used by FileOutputCommitter should be made configurable to let the 
> caller call with with its own unique value as per the requirement and avoid 
> it getting deleted or overwritten by other applications 
> Something like:
> {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary";
>  public static final String PENDING_DIR_NAME_DEFAULT =
>  "mapreduce.fileoutputcommitter.tempdir";
> {quote}
>  
> This can be used very efficiently by Spark applications to handle even stage 
> failures where temporary directories from previous attempts cause problem and 
> can help in so many situations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to