[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139927#comment-14139927 ]
Hadoop QA commented on YARN-2566: --------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669893/YARN-2566.000.patch against trunk revision 6434572. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5037//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5037//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5037//console This message is automatically generated. > IOException happen in startLocalizer of DefaultContainerExecutor due to not > enough disk space for the first localDir. > --------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2566.000.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir > to copy the token file, if the copy is failed for first localDir due to not > enough disk space in the first localDir, the localization will be failed even > there are plenty of disk space in other localDirs. We see the following error > for this case: > {code} > 2014-09-13 23:33:25,171 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to > create app directory > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.FileNotFoundException: File > file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,186 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1410663092546_0004_01_000001 transitioned from > LOCALIZING to LOCALIZATION_FAILED > 2014-09-13 23:33:25,187 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED > APPID=application_1410663092546_0004 > CONTAINERID=container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1410663092546_0004_01_000001 transitioned from > LOCALIZATION_FAILED to DONE > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1410663092546_0004_01_000001 from application > application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1410663092546_0004_01_000001 for > log-aggregation > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_STOP for appId application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete > returned false for path: > [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,188 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,188 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete > returned false for path: > [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,291 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1410663092546_0004_01_000001 > 2014-09-13 23:33:26,159 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed container container_1410663092546_0004_01_000001 > {code} > The correct way to do is If the IOException happened during the copy, try the > next the localDir, If all the localDirs are failed to copy, then throw a > exception. > I will create a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)