[ 
https://issues.apache.org/jira/browse/YARN-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503832#comment-13503832
 ] 

Jason Lowe commented on YARN-244:
---------------------------------

could you provide a bit more detail from the AM logs when this occurs?  I'm not 
able to reproduce this with a sleep job and manually killing the AM to simulate 
failure.  Normally the AM tries to determine if it is the last attempt and only 
deletes the files if it is convinced there will be more attempts.  If you could 
provide steps to reproduce or details from the AM logs showing why it decided 
to remove the staging directory that would help clarify what's going on in this 
case.
                
> Application Master Retries fail due to FileNotFoundException
> ------------------------------------------------------------
>
>                 Key: YARN-244
>                 URL: https://issues.apache.org/jira/browse/YARN-244
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>    Affects Versions: 2.0.2-alpha, 2.0.1-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>            Priority: Blocker
>
> Application attempt1 is deleting the job related files and these are not 
> present in the HDFS for following retries.
> {code:xml}
> Application application_1353724754961_0001 failed 4 times due to AM Container 
> for appattempt_1353724754961_0001_000004 exited with exitCode: -1000 due to: 
> RemoteTrace: java.io.FileNotFoundException: File does not exist: 
> hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
>  at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at 
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49) at 
> org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at 
> org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>  at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at 
> org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49) at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>  at java.lang.Thread.run(Thread.java:662) at LocalTrace: 
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File 
> does not exist: 
> hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
>  at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:822)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:492)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:221)
>  at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
>  at 
> org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
>  at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692) at 
> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at 
> java.security.AccessController.doPrivileged(Native Method) at 
> javax.security.auth.Subject.doAs(Subject.java:396) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) .Failing this 
> attempt.. Failing the application. 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to