steveloughran commented on code in PR #6716: URL: https://github.com/apache/hadoop/pull/6716#discussion_r1569228824
########## hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/AbstractJobOrTaskStage.java: ########## @@ -582,19 +605,46 @@ protected final Path directoryMustExist( * Save a task manifest or summary. This will be done by * writing to a temp path and then renaming. * If the destination path exists: Delete it. + * This will retry so that a rename failure from abfs load or IO errors + * will not fail the task. * @param manifestData the manifest/success file * @param tempPath temp path for the initial save * @param finalPath final path for rename. - * @throws IOException failure to load/parse + * @throws IOException failure to rename after retries. */ @SuppressWarnings("unchecked") protected final <T extends AbstractManifestData> void save(T manifestData, final Path tempPath, final Path finalPath) throws IOException { - LOG.trace("{}: save('{}, {}, {}')", getName(), manifestData, tempPath, finalPath); - trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> - operations.save(manifestData, tempPath, true)); - renameFile(tempPath, finalPath); + boolean success = false; + int failures = 0; + while (!success) { + try { + LOG.trace("{}: attempt {} save('{}, {}, {}')", + getName(), failures, manifestData, tempPath, finalPath); + + trackDurationOfInvocation(getIOStatistics(), OP_SAVE_TASK_MANIFEST, () -> + operations.save(manifestData, tempPath, true)); + renameFile(tempPath, finalPath); Review Comment: any error raised during rename triggers fallback of * catch IOE * save temp file again * delete dest path * rename temp path to final path this is attempted a configurable number of times, with a sleep in between. no attempt to be clever about which IOEs are unrecoverable (permissions etc), just catch, log, retry -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org