[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723385#comment-13723385
 ] 

Jian He commented on YARN-993:
------------------------------

[~prophy999] if you are running 2.0.5-alpha, to test RM restart, after you 
submit the job, you need to manually ctrl-c the command line after you see the 
message saying job is submitted, since MR will clean up the staging dir if RM 
is not available.
this problem has been fixed in YARN-513.
                
> job can not recovery after restart resourcemanager
> --------------------------------------------------
>
>                 Key: YARN-993
>                 URL: https://issues.apache.org/jira/browse/YARN-993
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>         Environment: CentOS5.3 JDK1.7.0_11
>            Reporter: prophy Yan
>            Priority: Critical
>
> Recently, i have test the function job recovery in the YARN framework, but it 
> failed.
> first, i run the wordcount example program, and the i kill -9 the 
> resourcemanager process on the server when the wordcount process in map 100%.
> the job will exit with error in minutes.
> second, i restart the resourcemanager on the server by user the 
> 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
> the yarn log says "file not exist!"
> Here is the YARN log:
> 013-07-23 16:05:21,472 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
> launching container Container: [ContainerId: 
> container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117, 
> NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>, 
> Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
> {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
> 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002
> 2013-07-23 16:05:21,473 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED
> 2013-07-23 16:05:21,925 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED
> 2013-07-23 16:05:21,925 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
> application_1374564764970_0001 failed 1 times due to AM Container for 
> appattempt_1374564764970_0001_000002 exited with  exitCode: -1000 due to: 
> RemoteTrace:
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>         at 
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
>         at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
>         at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
>  at LocalTrace:
>         org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
> File does not exist: 
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
>         at 
> org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> .Failing this attempt.. Failing the application.
> 2013-07-23 16:05:21,935 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1374564764970_0001 State change from ACCEPTED to FAILED
> 2013-07-23 16:05:21,937 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool   
>      OPERATION=Application Finished - Failed TARGET=RMAppManager     
> RESULT=FAILURE  DESCRIPTION=App failed with state: FAILED       
> PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to 
> AM Container for appattempt_1374564764970_0001_000002 exited with  exitCode: 
> -1000 due to: RemoteTrace:
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> this is the log in YARN-logfile after i restart the resourcemanager

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to