[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739219#comment-13739219 ] prophy Yan commented on YARN-993: - Jian He i have tryed the patch file in the YARN-513 list,but some error occur when i use the patch. my test version is hadoop2.0.5-alpha,so can this patch work with this version? thank you. job can not recovery after restart resourcemanager -- Key: YARN-993 URL: https://issues.apache.org/jira/browse/YARN-993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. the yarn log says file not exist! Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_02 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at
[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723970#comment-13723970 ] Jason Lowe commented on YARN-993: - This looks more like a MAPREDUCE issue to me. The MR AM is removing the staging directory when it shouldn't. As [~jianhe] noted, this is probably fixed by YARN-513 / MAPREDUCE-5398 or it could be a duplicate of YARN-917. job can not recovery after restart resourcemanager -- Key: YARN-993 URL: https://issues.apache.org/jira/browse/YARN-993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. the yarn log says file not exist! Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_02 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at
[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723339#comment-13723339 ] Aaron T. Myers commented on YARN-993: - If you're not using RM HA, I wouldn't necessarily expect this to work. But, regardless, this sounds like a YARN issue, so moving the JIRA to that project. job can not recovery after restart resourcemanager -- Key: YARN-993 URL: https://issues.apache.org/jira/browse/YARN-993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. the yarn log says file not exist! Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_02 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491) at
[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
[ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723385#comment-13723385 ] Jian He commented on YARN-993: -- [~prophy999] if you are running 2.0.5-alpha, to test RM restart, after you submit the job, you need to manually ctrl-c the command line after you see the message saying job is submitted, since MR will clean up the staging dir if RM is not available. this problem has been fixed in YARN-513. job can not recovery after restart resourcemanager -- Key: YARN-993 URL: https://issues.apache.org/jira/browse/YARN-993 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. the yarn log says file not exist! Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_02 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at