[ https://issues.apache.org/jira/browse/YARN-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227182#comment-14227182 ]
Rohith commented on YARN-2909: ------------------------------ Hi [~cindy2012], which version of hadoop build are you using? In older version i.e 2.4, RM HA introduced.It was not meant for recovering running jobs. Work preserving restart feature done in YARN-556. If you are already using latest version then configuring peroperty {{yarn.resourcemanager.work-preserving-recovery.enabled}} enables work preserving for running jobs. > RM failover lead to failure to recognize previously running application > attempt > ------------------------------------------------------------------------------- > > Key: YARN-2909 > URL: https://issues.apache.org/jira/browse/YARN-2909 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Cindy Li > > Seeing this from client side (Hive job): > 2014-11-25 10:00:50,179 Stage-3 map = 100%, reduce = 99%, Cumulative CPU > 136560.72 sec > 2014-11-25 10:00:54,776 Stage-3 map = 100%, reduce = 0%, Cumulative CPU > 42627.6 sec > 2014-11-25 10:01:20,097 Stage-3 map = 0%, reduce = 0% > 2014-11-25 10:02:20,348 Stage-3 map = 0%, reduce = 0% > 2014-11-25 10:02:30,702 Stage-3 map = 1%, reduce = 0%, Cumulative CPU > 1511.98 sec > Seeing this on resource manager (rm2): > [14:16]:[hadoop@phxaishdc20en0008-be:logs]# grep > container_1416845430616_0009_01_000001 * > hadoop-hadoop-resourcemanager.log.2014-11-25-10:2014-11-25 10:01:09,757 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: > received container statuses on node manager register :[ContainerStatus: > [ContainerId: container_1416845430616_0009_01_000001, State: COMPLETE, > Diagnostics: , ExitStatus: 0, ], ContainerStatus: [ContainerId: > container_1416845430616_0014_01_000241, State: COMPLETE, Diagnostics: > Container Killed by ResourceManager > Seeing this in container log (Logs for container_1416845430616_0009_01_000001) > 2014-11-25 10:59:17,839 INFO [IPC Server handler 2 on 36552] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1416845430616_0009_r_000000_0 is : 0.9893525 > 2014-11-25 10:59:47,905 INFO [IPC Server handler 26 on 36552] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1416845430616_0009_r_000000_0 is : 0.9893525 > 2014-11-25 11:00:01,672 INFO [RMCommunicator Allocator] > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 2014-11-25 11:00:17,977 INFO [IPC Server handler 5 on 36552] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1416845430616_0009_r_000000_0 is : 0.9893525 > 2014-11-25 11:00:36,882 INFO [RMCommunicator Allocator] > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over > to rm1 > 2014-11-25 11:00:48,044 INFO [IPC Server handler 2 on 36552] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt > attempt_1416845430616_0009_r_000000_0 is : 0.9893525 > 2014-11-25 11:00:52,901 INFO [RMCommunicator Allocator] > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > 2014-11-25 11:00:52,912 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Error > communicating with RM: Resource Manager doesn't recognize AttemptId: > application_1416845430616_0009 > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Resource Manager > doesn't recognize AttemptId: application_1416845430616_0009 > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:580) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:220) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:268) > at java.lang.Thread.run(Thread.java:745) > 2014-11-25 11:00:52,915 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: > job_1416845430616_0009Job Transitioned from RUNNING to REBOOT > 2014-11-25 11:00:52,916 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify RMCommunicator > isAMLastRetry: false > 2014-11-25 11:00:52,916 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: RMCommunicator > notified that shouldUnregistered is: false > 2014-11-25 11:00:52,916 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Notify JHEH isAMLastRetry: > false > 2014-11-25 11:00:52,916 INFO [Thread-3928] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: > JobHistoryEventHandler notified that forceJobCompletion is false > 2014-11-25 11:00:52,916 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Calling stop for all the > services > 2014-11-25 11:00:52,917 INFO [Thread-3928] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping > JobHistoryEventHandler. Size of the outstanding queue size is 0 > 2014-11-25 11:00:52,946 INFO [Thread-3928] > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped > JobHistoryEventHandler. super.stop() > 2014-11-25 11:00:52,947 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: KILLING > attempt_1416845430616_0009_r_000000_0 > 2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1416845430616_0009_r_000000_0 TaskAttempt Transitioned from RUNNING > to KILLED > 2014-11-25 11:00:52,955 INFO [AsyncDispatcher event handler] > org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: > attempt_1416845430616_0009_r_000000_1 TaskAttempt Transitioned from NEW to > UNASSIGNED > 2014-11-25 11:00:52,958 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Final Stats: > PendingReds:0 ScheduledMaps:0 ScheduledReds:0 AssignedMaps:0 AssignedReds:1 > CompletedMaps:414 CompletedReds:0 ContAlloc:827 ContRel:411 HostLocal:298 > RackLocal:108 > 2014-11-25 11:00:52,961 INFO [Thread-3928] > org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Skipping cleaning up the > staging dir. assuming AM will be retried. > 2014-11-25 11:00:52,962 INFO [Thread-3928] org.apache.hadoop.ipc.Server: > Stopping server on 36552 > 2014-11-25 11:00:52,968 INFO [IPC Server listener on 36552] > org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 36552 > 2014-11-25 11:00:52,968 INFO [IPC Server Responder] > org.apache.hadoop.ipc.Server: Stopping IPC Server Responder > 2014-11-25 11:00:52,968 INFO [TaskHeartbeatHandler PingChecker] > org.apache.hadoop.mapreduce.v2.app.TaskHeartbeatHandler: TaskHeartbeatHandler > thread interrupted -- This message was sent by Atlassian JIRA (v6.3.4#6332)