[ https://issues.apache.org/jira/browse/YARN-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503832#comment-13503832 ]
Jason Lowe commented on YARN-244: --------------------------------- could you provide a bit more detail from the AM logs when this occurs? I'm not able to reproduce this with a sleep job and manually killing the AM to simulate failure. Normally the AM tries to determine if it is the last attempt and only deletes the files if it is convinced there will be more attempts. If you could provide steps to reproduce or details from the AM logs showing why it decided to remove the staging directory that would help clarify what's going on in this case. > Application Master Retries fail due to FileNotFoundException > ------------------------------------------------------------ > > Key: YARN-244 > URL: https://issues.apache.org/jira/browse/YARN-244 > Project: Hadoop YARN > Issue Type: Bug > Components: applications > Affects Versions: 2.0.2-alpha, 2.0.1-alpha > Reporter: Devaraj K > Assignee: Devaraj K > Priority: Blocker > > Application attempt1 is deleting the job related files and these are not > present in the HDFS for following retries. > {code:xml} > Application application_1353724754961_0001 failed 4 times due to AM Container > for appattempt_1353724754961_0001_000004 exited with exitCode: -1000 due to: > RemoteTrace: java.io.FileNotFoundException: File does not exist: > hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752) > at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at > org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at > org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at > org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at > org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at > java.util.concurrent.FutureTask.run(FutureTask.java:138) at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at > java.util.concurrent.FutureTask.run(FutureTask.java:138) at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) at LocalTrace: > org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File > does not exist: > hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:822) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:492) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:221) > at > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) > at > org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at > org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:396) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) .Failing this > attempt.. Failing the application. > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira