[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307365#comment-17307365 ]
Bimalendu Choudhary commented on MAPREDUCE-7331: ------------------------------------------------ The temporary files gets deleted at the end of the commitJob when we get the pendingjobAttemptPath and simply delete that path. So anything inside gets deleted. I don't think that underlying attempt task attempt paths get deleted individually. So in my case whether the other application had the same Mapreduce jobId or not, does not matter. Even if they share the same JObID/taskattempt path, they will be writing to different partition directories inside it. To me looks like on application finishes first and ends up deleting the whole _temporary directory. For now the workaorund we are trying out is configuring not to delete the _temporary file at the end when we know that we have multiple spark application using the same directory. In my case we are running multiple Spark application to process individual partition of the same table to make the processing fast. Since all are separate partitions so there is no chance of data interference. But we end up getting FileNotFound exception. > Make temporary directory used by FileOutputCommitter configurable > ----------------------------------------------------------------- > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 > Reporter: Bimalendu Choudhary > Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org