[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhihai xu updated YARN-2566: ---------------------------- Attachment: YARN-2566.005.patch > IOException happen in startLocalizer of DefaultContainerExecutor due to not > enough disk space for the first localDir. > --------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Attachments: YARN-2566.000.patch, YARN-2566.001.patch, > YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, > YARN-2566.005.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir > to copy the token file, if the copy is failed for first localDir due to not > enough disk space in the first localDir, the localization will be failed even > there are plenty of disk space in other localDirs. We see the following error > for this case: > {code} > 2014-09-13 23:33:25,171 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to > create app directory > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at > org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > Localizer failed > java.io.FileNotFoundException: File > file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at > org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at > org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at > org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,186 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1410663092546_0004_01_000001 transitioned from > LOCALIZING to LOCALIZATION_FAILED > 2014-09-13 23:33:25,187 WARN > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera > OPERATION=Container Finished - Failed TARGET=ContainerImpl > RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED > APPID=application_1410663092546_0004 > CONTAINERID=container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: > Container container_1410663092546_0004_01_000001 transitioned from > LOCALIZATION_FAILED to DONE > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: > Removing container_1410663092546_0004_01_000001 from application > application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: > Considering container container_1410663092546_0004_01_000001 for > log-aggregation > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got > event CONTAINER_STOP for appId application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete > returned false for path: > [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,188 INFO > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting > absolute path : > /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,188 WARN > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete > returned false for path: > [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,291 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Stopping resource-monitoring for container_1410663092546_0004_01_000001 > 2014-09-13 23:33:26,159 INFO > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed > completed container container_1410663092546_0004_01_000001 > {code} > The correct way to do is If the IOException happened during the copy, try the > next the localDir, If all the localDirs are failed to copy, then throw a > exception. > I will create a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)