[ https://issues.apache.org/jira/browse/MAPREDUCE-7331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307252#comment-17307252 ]
Steve Loughran commented on MAPREDUCE-7331: ------------------------------------------- Does the spark version you have contain the fix [SPARK-33402][CORE] Jobs launched in same second have duplicate MapReduce JobIDs ? As that may the underlying problem: you have >1 stage reusing the same jobID, so are using the same job directory under _temporary. Apply that fix first before worrying about going anywhere near FileOutputCommitter. We are scared of changes there as it is a critical part of so many applications. > Make temporary directory used by FileOutputCommitter configurable > ----------------------------------------------------------------- > > Key: MAPREDUCE-7331 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7331 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 3.0.0 > Environment: CDH 6.2.1 Hadoop 3.0.0 > Reporter: Bimalendu Choudhary > Priority: Major > > Spark SQL applications uses FileOutputCommitter to commit and merge its files > under a table directory. The hardcoded PENDING_DIR_NAME = _temporary > directory results in multiple application using the same temporary directory. > This casues unwanted results of one application interfering with other > applications temporary files. Also one application ending up deleting > temporary files of other. There is no way right now for applications to have > there unique path to store the temporary files to avoid any interference from > other totally independent applications. I think the temporary directory > being used by FileOutputCommitter should be made configurable to let the > caller call with with its own unique value as per the requirement and avoid > it getting deleted or overwritten by other applications > Something like: > {quote}public static final String PENDING_DIR_NAME_DEFAULT = "_temporary"; > public static final String PENDING_DIR_NAME_DEFAULT = > "mapreduce.fileoutputcommitter.tempdir"; > {quote} > > This can be used very efficiently by Spark applications to handle even stage > failures where temporary directories from previous attempts cause problem and > can help in so many situations. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org