[ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhihai xu updated YARN-3727: ---------------------------- Description: For better error recovery, check if the directory exists before using it for localization. We saw the following localization failure happened due to existing cache directories. {code} 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED {code} The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or others. I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization. The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}} will be deleted. {code} try { ......... files.rename(dst_work, destDirPath, Rename.OVERWRITE); } catch (Exception e) { try { files.delete(destDirPath, true); } catch (IOException ignore) { } throw e; } finally { {code} Since the conflicting local directory will be deleted after localization failure, I think it will be better to check if the directory exists before using it for localization to avoid the localization failure. was: For better error recovery, check if the directory exists before using it for localization. We saw the following localization failure happened due to existing cache directories. {code} 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED {code} The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or others. I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization. The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}} will be deleted. {{code}} try { ......... files.rename(dst_work, destDirPath, Rename.OVERWRITE); } catch (Exception e) { try { files.delete(destDirPath, true); } catch (IOException ignore) { } throw e; } finally { {{code}} Since the conflicting local directory will be deleted after localization failure, I think it will be better to check if the directory exists before using it for localization to avoid the localization failure. > For better error recovery, check if the directory exists before using it for > localization. > ------------------------------------------------------------------------------------------ > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.7.0 > Reporter: zhihai xu > Assignee: zhihai xu > > For better error recovery, check if the directory exists before using it for > localization. > We saw the following localization failure happened due to existing cache > directories. > {code} > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, > null }, Rename cannot overwrite non empty destination directory > /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637 > 2015-05-11 18:59:59,756 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar) > transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation > failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or > others. > I wonder whether we can add error recovery code to avoid the localization > failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, > Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after > the exception, the existing cache directory used by {{LocalizedResource}} > will be deleted. > {code} > try { > ......... > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization > failure, > I think it will be better to check if the directory exists before using it > for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)