[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-08-13 Thread prophy Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739219#comment-13739219
 ] 

prophy Yan commented on YARN-993:
-

Jian He i have tryed the patch file in the YARN-513 list,but some error occur 
when i use the patch. my test version is hadoop2.0.5-alpha,so can this patch 
work with this version? thank you.

 job can not recovery after restart resourcemanager
 --

 Key: YARN-993
 URL: https://issues.apache.org/jira/browse/YARN-993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: CentOS5.3 JDK1.7.0_11
Reporter: prophy Yan
Priority: Critical

 Recently, i have test the function job recovery in the YARN framework, but it 
 failed.
 first, i run the wordcount example program, and the i kill -9 the 
 resourcemanager process on the server when the wordcount process in map 100%.
 the job will exit with error in minutes.
 second, i restart the resourcemanager on the server by user the 
 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
 the yarn log says file not exist!
 Here is the YARN log:
 013-07-23 16:05:21,472 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
 launching container Container: [ContainerId: 
 container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
 NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, 
 Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
 {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
 2013-07-23 16:05:21,473 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
 application_1374564764970_0001 failed 1 times due to AM Container for 
 appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
 RemoteTrace:
 java.io.FileNotFoundException: File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
 at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
  at LocalTrace:
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
 at 
 

[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-07-30 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723970#comment-13723970
 ] 

Jason Lowe commented on YARN-993:
-

This looks more like a MAPREDUCE issue to me.  The MR AM is removing the 
staging directory when it shouldn't.  As [~jianhe] noted, this is probably 
fixed by YARN-513 / MAPREDUCE-5398 or it could be a duplicate of YARN-917.

 job can not recovery after restart resourcemanager
 --

 Key: YARN-993
 URL: https://issues.apache.org/jira/browse/YARN-993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: CentOS5.3 JDK1.7.0_11
Reporter: prophy Yan
Priority: Critical

 Recently, i have test the function job recovery in the YARN framework, but it 
 failed.
 first, i run the wordcount example program, and the i kill -9 the 
 resourcemanager process on the server when the wordcount process in map 100%.
 the job will exit with error in minutes.
 second, i restart the resourcemanager on the server by user the 
 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
 the yarn log says file not exist!
 Here is the YARN log:
 013-07-23 16:05:21,472 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
 launching container Container: [ContainerId: 
 container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
 NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, 
 Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
 {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
 2013-07-23 16:05:21,473 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
 application_1374564764970_0001 failed 1 times due to AM Container for 
 appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
 RemoteTrace:
 java.io.FileNotFoundException: File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
 at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
  at LocalTrace:
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
 at 
 

[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-07-29 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723339#comment-13723339
 ] 

Aaron T. Myers commented on YARN-993:
-

If you're not using RM HA, I wouldn't necessarily expect this to work. But, 
regardless, this sounds like a YARN issue, so moving the JIRA to that project.

 job can not recovery after restart resourcemanager
 --

 Key: YARN-993
 URL: https://issues.apache.org/jira/browse/YARN-993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: CentOS5.3 JDK1.7.0_11
Reporter: prophy Yan
Priority: Critical

 Recently, i have test the function job recovery in the YARN framework, but it 
 failed.
 first, i run the wordcount example program, and the i kill -9 the 
 resourcemanager process on the server when the wordcount process in map 100%.
 the job will exit with error in minutes.
 second, i restart the resourcemanager on the server by user the 
 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
 the yarn log says file not exist!
 Here is the YARN log:
 013-07-23 16:05:21,472 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
 launching container Container: [ContainerId: 
 container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
 NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, 
 Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
 {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
 2013-07-23 16:05:21,473 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
 application_1374564764970_0001 failed 1 times due to AM Container for 
 appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
 RemoteTrace:
 java.io.FileNotFoundException: File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
 at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
  at LocalTrace:
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
 at 
 

[jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager

2013-07-29 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723385#comment-13723385
 ] 

Jian He commented on YARN-993:
--

[~prophy999] if you are running 2.0.5-alpha, to test RM restart, after you 
submit the job, you need to manually ctrl-c the command line after you see the 
message saying job is submitted, since MR will clean up the staging dir if RM 
is not available.
this problem has been fixed in YARN-513.

 job can not recovery after restart resourcemanager
 --

 Key: YARN-993
 URL: https://issues.apache.org/jira/browse/YARN-993
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
 Environment: CentOS5.3 JDK1.7.0_11
Reporter: prophy Yan
Priority: Critical

 Recently, i have test the function job recovery in the YARN framework, but it 
 failed.
 first, i run the wordcount example program, and the i kill -9 the 
 resourcemanager process on the server when the wordcount process in map 100%.
 the job will exit with error in minutes.
 second, i restart the resourcemanager on the server by user the 
 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
 the yarn log says file not exist!
 Here is the YARN log:
 013-07-23 16:05:21,472 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done 
 launching container Container: [ContainerId: 
 container_1374564764970_0001_02_01, NodeId: mv8.mzhen.cn:52117, 
 NodeHttpAddress: mv8.mzhen.cn:8042, Resource: memory:2048, vCores:1, 
 Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id 
 {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 
 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_02
 2013-07-23 16:05:21,473 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from ALLOCATED to LAUNCHED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
 appattempt_1374564764970_0001_02 State change from LAUNCHED to FAILED
 2013-07-23 16:05:21,925 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application 
 application_1374564764970_0001 failed 1 times due to AM Container for 
 appattempt_1374564764970_0001_02 exited with  exitCode: -1000 due to: 
 RemoteTrace:
 java.io.FileNotFoundException: File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
 at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
 at 
 org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
 at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
 at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
  at LocalTrace:
 org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: 
 File does not exist: 
 hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
 at 
 org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
 at