[ 
https://issues.apache.org/jira/browse/YARN-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zbigniew Kostrzewa updated YARN-9064:
-------------------------------------
    Description: 
I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos 
and configured in HA. The first 3 nodes hold both slave and master services:
 * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History 
Server, DataNode, NodeManager, ZooKeeper and Kerberos
 * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, 
ZooKeeper and Kerberos
 * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper
 * Node-4..Node-31: DataNode and NodeManager

At one moment there was a problem with the switch the nodes were connected to 
and all the services started loosing connectivity.
# At first Kerberos stopped granting any tickets
# This broke the cluster as Hadoop services could not authenticate to each 
other.
# At some point ZooKeeper cluster lost leader and started re-election.
# This resulted in multiple ZooKeeper-related errors and warnings in 
ResourceManager and ZKFC logs.
# After a while, when the issue with the switch was resolved most of services 
recovered automatically
# "Most" except YARN:
## both ResourceManager were stuck in standby mode
## all NodeManagers were shutdown
# I have managed to recover YARN, however it required manual restart of both 
ResourceManagers (and starting all NodeManagers)

I have all the logs from the incident but the most important seem to be those:
{noformat}
2018-11-16 03:21:16,420 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Application finished, removing password for appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1539778834071_0622 State change from NEW to ACCEPTED on event = 
RECOVER
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully 
recovered 622 out of 622 applications
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of 
failed attempts is 0. The max attempts is 1
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended
2018-11-16 03:21:16,425 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Registering app attempt : appattempt_1539778834071_0622_000002
2018-11-16 03:21:16,426 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on 
event = START
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
 Rolling master-key for container-tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Rolling master-key for nm-tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 storing master key with keyID 32
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
RMDTMasterKey.
2018-11-16 03:21:16,440 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScanInterval=60 
min(s)
2018-11-16 03:21:16,441 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,444 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 storing master key with keyID 33
2018-11-16 03:21:16,445 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
RMDTMasterKey.
2018-11-16 03:21:16,458 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
Application application_1539778834071_0622 requests cleared
2018-11-16 03:21:16,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler 
from user packer
2018-11-16 03:21:16,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED 
on event = ATTEMPT_ADDED
2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000
2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in 
state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.io.IOException: Failed on local exception: java.net.SocketException: 
Unresolved address; Host Details : local host is: "node-2.mydomain.com"; 
destination host is: (unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state STARTED; cause: 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,470 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, 
interrupted : java.lang.InterruptedException
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer
 thread interrupted
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,472 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,472 ERROR 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping ResourceManager metrics system...
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system stopped.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system shutdown complete.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher is draining to stop, igonring any new events.
2018-11-16 03:21:16,477 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread
 thread interrupted! Exiting!
2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x3671a89731f0000 closed
2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
 ContainerTokenKeyRollingInterval: 86400000ms and 
ContainerTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 
ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: 
Using RMStateStore implementation - class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType 
for class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for 
class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
2018-11-16 03:21:16,491 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType
 for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType
 for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
loaded properties from hadoop-metrics2.properties
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Scheduled snapshot period at 10 second(s).
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system started
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType 
for class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to 
register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already 
exists.
2018-11-16 03:21:16,494 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo 
MBean
2018-11-16 03:21:16,494 INFO 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
YARN system metrics publishing service is not enabled
2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2018-11-16 03:21:16,496 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=transitionToActive    TARGET=RMHAProtocolService      RESULT=FAILURE  
DESCRIPTION=Exception transitioning to active   PERMISSIONS=
2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                ... 4 more
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.io.IOException: Failed on local exception: java.net.SocketException: 
Unresolved address; Host Details : local host is: "node-2.mydomain.com"; 
destination host is: (unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                ... 5 more
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x36681eb8c720002 closed
2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, 
connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181
 sessionTimeout=10000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b
2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: 
Unable to connect to server: node-2.mydomain.com:2181
java.net.UnknownHostException: node-2.mydomain.com
                at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
                at java.net.InetAddress.getAllByName(InetAddress.java:1192)
                at java.net.InetAddress.getAllByName(InetAddress.java:1126)
                at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
                at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
                at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to 
authenticate using SASL (unknown error)
2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established, initiating session, client: /10.242.1.105:46773, server: 
node-3.mydomain.com/10.242.1.106:2181
2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server node-3.mydomain.com/10.242.1.106:2181, 
sessionid = 0x3671a89731f0003, negotiated timeout = 10000
2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x36681eb8c720002
2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in 
standby state
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
2018-11-16 03:30:57,669 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Release request cache is cleaned up
2018-11-16 03:31:16,496 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Release request cache is cleaned up
2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-19 13:35:39,357 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-19 13:35:45,785 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
AuthenticationToken ignored: 
org.apache.hadoop.security.authentication.util.SignerException: Invalid 
signature
2018-11-21 08:29:19,995 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
AuthenticationToken ignored: 
org.apache.hadoop.security.authentication.util.SignerException: Invalid 
signature
2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:29:23,666 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:31:37,258 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
{noformat}
I have found a few tickets about some race conditions in YARN popping out when 
issues with connecting to ZooKeeper occur but either they should have been fix 
in 2.6.0 or the logs don't match.

  was:
I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos 
and configured in HA. The first 3 nodes hold both slave and master services:
 * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History 
Server, DataNode, NodeManager, ZooKeeper and Kerberos
 * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, 
ZooKeeper and Kerberos
 * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper
 * Node-4..Node-31: DataNode and NodeManager

At one moment there was a problem with the switch the nodes were connected to 
and all the services started loosing connectivity.
 1. At first Kerberos stopped granting any tickets
 2. This broke the cluster as Hadoop services could not authenticate to each 
other.
 3. At some point ZooKeeper cluster lost leader and started re-election.
 4. This resulted in multiple ZooKeeper-related errors and warnings in 
ResourceManager and ZKFC logs.
 5. After a while, when the issue with the switch was resolved most of services 
recovered automatically
 6. "Most" except YARN:
 a. both ResourceManager were stuck in standby mode
 b. all NodeManagers were shutdown
 7. I have managed to recover YARN, however it required manual restart of both 
ResourceManagers (and starting all NodeManagers)

I have all the logs from the incident but the most important seem to be those:
{noformat}
2018-11-16 03:21:16,420 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Unregistering app attempt : appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
Application finished, removing password for appattempt_1539778834071_0622_000001
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
application_1539778834071_0622 State change from NEW to ACCEPTED on event = 
RECOVER
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully 
recovered 622 out of 622 applications
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of 
failed attempts is 0. The max attempts is 1
2018-11-16 03:21:16,424 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended
2018-11-16 03:21:16,425 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
Registering app attempt : appattempt_1539778834071_0622_000002
2018-11-16 03:21:16,426 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on 
event = START
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
 Rolling master-key for container-tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 Rolling master-key for nm-tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 storing master key with keyID 32
2018-11-16 03:21:16,427 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
RMDTMasterKey.
2018-11-16 03:21:16,440 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScanInterval=60 
min(s)
2018-11-16 03:21:16,441 INFO 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
2018-11-16 03:21:16,444 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
 storing master key with keyID 33
2018-11-16 03:21:16,445 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
RMDTMasterKey.
2018-11-16 03:21:16,458 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
Application application_1539778834071_0622 requests cleared
2018-11-16 03:21:16,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler 
from user packer
2018-11-16 03:21:16,459 INFO 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED 
on event = ATTEMPT_ADDED
2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000
2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in 
state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.io.IOException: Failed on local exception: java.net.SocketException: 
Unresolved address; Host Details : local host is: "node-2.mydomain.com"; 
destination host is: (unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state STARTED; cause: 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.net.SocketException: Unresolved address; Host 
Details : local host is: "node-2.mydomain.com"; destination host is: 
(unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,470 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, 
interrupted : java.lang.InterruptedException
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer
 thread interrupted
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,471 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,472 INFO 
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
thread interrupted
2018-11-16 03:21:16,472 ERROR 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Stopping ResourceManager metrics system...
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system stopped.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system shutdown complete.
2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
AsyncDispatcher is draining to stop, igonring any new events.
2018-11-16 03:21:16,477 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread
 thread interrupted! Exiting!
2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x3671a89731f0000 closed
2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
 NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
 ContainerTokenKeyRollingInterval: 86400000ms and 
ContainerTokenKeyActivationDelay: 900000ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: 
AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 
ms
2018-11-16 03:21:16,490 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: 
Using RMStateStore implementation - class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType 
for class 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for 
class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
2018-11-16 03:21:16,491 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType
 for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType
 for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
loaded properties from hadoop-metrics2.properties
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
Scheduled snapshot period at 10 second(s).
2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: 
ResourceManager metrics system started
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
Registering class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType 
for class 
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to 
register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already 
exists.
2018-11-16 03:21:16,494 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo 
MBean
2018-11-16 03:21:16,494 INFO 
org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
YARN system metrics publishing service is not enabled
2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing 
hosts (include/exclude) list
2018-11-16 03:21:16,496 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=transitionToActive    TARGET=RMHAProtocolService      RESULT=FAILURE  
DESCRIPTION=Exception transitioning to active   PERMISSIONS=
2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311)
                at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
                ... 4 more
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.io.IOException: Failed on local exception: java.net.SocketException: 
Unresolved address; Host Details : local host is: "node-2.mydomain.com"; 
destination host is: (unknown):0; 
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
                at 
org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
                at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
                at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
                at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
                at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
                ... 5 more
Caused by: java.io.IOException: Failed on local exception: 
java.net.SocketException: Unresolved address; Host Details : local host is: 
"node-2.mydomain.com"; destination host is: (unknown):0; 
                at 
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
                at org.apache.hadoop.ipc.Server.bind(Server.java:522)
                at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
                at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
                at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
                at 
org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
                at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
                at 
org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
                ... 20 more
Caused by: java.net.SocketException: Unresolved address
                at sun.nio.ch.Net.translateToSocketException(Net.java:131)
                at sun.nio.ch.Net.translateException(Net.java:157)
                at sun.nio.ch.Net.translateException(Net.java:163)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
                at org.apache.hadoop.ipc.Server.bind(Server.java:505)
                ... 28 more
Caused by: java.nio.channels.UnresolvedAddressException
                at sun.nio.ch.Net.checkAddress(Net.java:101)
                at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
                at 
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
                ... 29 more
2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session
2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x36681eb8c720002 closed
2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, 
connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181
 sessionTimeout=10000 
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b
2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: 
Unable to connect to server: node-2.mydomain.com:2181
java.net.UnknownHostException: node-2.mydomain.com
                at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
                at java.net.InetAddress.getAllByName(InetAddress.java:1192)
                at java.net.InetAddress.getAllByName(InetAddress.java:1126)
                at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
                at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
                at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530)
                at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
                at 
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to 
authenticate using SASL (unknown error)
2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established, initiating session, client: /10.242.1.105:46773, server: 
node-3.mydomain.com/10.242.1.106:2181
2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server node-3.mydomain.com/10.242.1.106:2181, 
sessionid = 0x3671a89731f0003, negotiated timeout = 10000
2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session 
connected.
2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x36681eb8c720002
2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found 
resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in 
standby state
2018-11-16 03:21:17,588 INFO 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
2018-11-16 03:30:57,669 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Release request cache is cleaned up
2018-11-16 03:31:16,496 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: 
Release request cache is cleaned up
2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-19 13:35:39,357 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-19 13:35:45,785 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
AuthenticationToken ignored: 
org.apache.hadoop.security.authentication.util.SignerException: Invalid 
signature
2018-11-21 08:29:19,995 WARN 
org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
AuthenticationToken ignored: 
org.apache.hadoop.security.authentication.util.SignerException: Invalid 
signature
2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
[Caused by GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)]
2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:29:23,666 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth 
successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
2018-11-21 08:31:37,258 INFO 
SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
 Authorization successful for packer/node-2.mydomain.com@SA_REALM 
(auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
{noformat}
I have found a few tickets about some race conditions in YARN popping out when 
issues with connecting to ZooKeeper occur but either they should have been fix 
in 2.6.0 or the logs don't match.


> Both Resource Managers stay in standby after connection to ZooKeeper was 
> recovered
> ----------------------------------------------------------------------------------
>
>                 Key: YARN-9064
>                 URL: https://issues.apache.org/jira/browse/YARN-9064
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, yarn
>    Affects Versions: 2.6.0
>         Environment: * cluster of 31 nodes
> * each node is a VM with 60GB of RAM and 8 vcpus
> * each VM is running CentOS 7.2 with Hadoop 2.6.0
> * Hadoop cluster is secured with Kerberos
> * Hadoop cluster is configured with HA
>            Reporter: Zbigniew Kostrzewa
>            Priority: Major
>
> I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos 
> and configured in HA. The first 3 nodes hold both slave and master services:
>  * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History 
> Server, DataNode, NodeManager, ZooKeeper and Kerberos
>  * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, 
> NodeManager, ZooKeeper and Kerberos
>  * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper
>  * Node-4..Node-31: DataNode and NodeManager
> At one moment there was a problem with the switch the nodes were connected to 
> and all the services started loosing connectivity.
> # At first Kerberos stopped granting any tickets
> # This broke the cluster as Hadoop services could not authenticate to each 
> other.
> # At some point ZooKeeper cluster lost leader and started re-election.
> # This resulted in multiple ZooKeeper-related errors and warnings in 
> ResourceManager and ZKFC logs.
> # After a while, when the issue with the switch was resolved most of services 
> recovered automatically
> # "Most" except YARN:
> ## both ResourceManager were stuck in standby mode
> ## all NodeManagers were shutdown
> # I have managed to recover YARN, however it required manual restart of both 
> ResourceManagers (and starting all NodeManagers)
> I have all the logs from the incident but the most important seem to be those:
> {noformat}
> 2018-11-16 03:21:16,420 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Unregistering app attempt : appattempt_1539778834071_0622_000001
> 2018-11-16 03:21:16,424 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  Application finished, removing password for 
> appattempt_1539778834071_0622_000001
> 2018-11-16 03:21:16,424 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1539778834071_0622 State change from NEW to ACCEPTED on event = 
> RECOVER
> 2018-11-16 03:21:16,424 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully 
> recovered 622 out of 622 applications
> 2018-11-16 03:21:16,424 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of 
> failed attempts is 0. The max attempts is 1
> 2018-11-16 03:21:16,424 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended
> 2018-11-16 03:21:16,425 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> Registering app attempt : appattempt_1539778834071_0622_000002
> 2018-11-16 03:21:16,426 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on 
> event = START
> 2018-11-16 03:21:16,427 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
>  Rolling master-key for container-tokens
> 2018-11-16 03:21:16,427 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
>  Rolling master-key for nm-tokens
> 2018-11-16 03:21:16,427 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2018-11-16 03:21:16,427 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
>  storing master key with keyID 32
> 2018-11-16 03:21:16,427 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
> RMDTMasterKey.
> 2018-11-16 03:21:16,440 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Starting expired delegation token remover thread, 
> tokenRemoverScanInterval=60 min(s)
> 2018-11-16 03:21:16,441 INFO 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  Updating the current master key for generating delegation tokens
> 2018-11-16 03:21:16,444 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
>  storing master key with keyID 33
> 2018-11-16 03:21:16,445 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing 
> RMDTMasterKey.
> 2018-11-16 03:21:16,458 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: 
> Application application_1539778834071_0622 requests cleared
> 2018-11-16 03:21:16,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: 
> Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler 
> from user packer
> 2018-11-16 03:21:16,459 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED 
> on event = ATTEMPT_ADDED
> 2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using 
> callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000
> 2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: 
> Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService 
> failed in state STARTED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
> Failed on local exception: java.net.SocketException: Unresolved address; Host 
> Details : local host is: "node-2.mydomain.com"; destination host is: 
> (unknown):0; 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
> Failed on local exception: java.net.SocketException: Unresolved address; Host 
> Details : local host is: "node-2.mydomain.com"; destination host is: 
> (unknown):0; 
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
>               at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
>               at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
>               at java.security.AccessController.doPrivileged(Native Method)
>               at javax.security.auth.Subject.doAs(Subject.java:422)
>               at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.io.IOException: Failed on local exception: 
> java.net.SocketException: Unresolved address; Host Details : local host is: 
> "node-2.mydomain.com"; destination host is: (unknown):0; 
>               at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:522)
>               at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
>               at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
>               at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
>               at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
>               ... 20 more
> Caused by: java.net.SocketException: Unresolved address
>               at sun.nio.ch.Net.translateToSocketException(Net.java:131)
>               at sun.nio.ch.Net.translateException(Net.java:157)
>               at sun.nio.ch.Net.translateException(Net.java:163)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:505)
>               ... 28 more
> Caused by: java.nio.channels.UnresolvedAddressException
>               at sun.nio.ch.Net.checkAddress(Net.java:101)
>               at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>               ... 29 more
> 2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state STARTED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
> Failed on local exception: java.net.SocketException: Unresolved address; Host 
> Details : local host is: "node-2.mydomain.com"; destination host is: 
> (unknown):0; 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
> Failed on local exception: java.net.SocketException: Unresolved address; Host 
> Details : local host is: "node-2.mydomain.com"; destination host is: 
> (unknown):0; 
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
>               at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
>               at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
>               at java.security.AccessController.doPrivileged(Native Method)
>               at javax.security.auth.Subject.doAs(Subject.java:422)
>               at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.io.IOException: Failed on local exception: 
> java.net.SocketException: Unresolved address; Host Details : local host is: 
> "node-2.mydomain.com"; destination host is: (unknown):0; 
>               at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:522)
>               at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
>               at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
>               at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
>               at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
>               ... 20 more
> Caused by: java.net.SocketException: Unresolved address
>               at sun.nio.ch.Net.translateToSocketException(Net.java:131)
>               at sun.nio.ch.Net.translateException(Net.java:157)
>               at sun.nio.ch.Net.translateException(Net.java:163)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:505)
>               ... 28 more
> Caused by: java.nio.channels.UnresolvedAddressException
>               at sun.nio.ch.Net.checkAddress(Net.java:101)
>               at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>               ... 29 more
> 2018-11-16 03:21:16,470 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, 
> interrupted : java.lang.InterruptedException
> 2018-11-16 03:21:16,471 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer
>  thread interrupted
> 2018-11-16 03:21:16,471 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
> thread interrupted
> 2018-11-16 03:21:16,471 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor 
> thread interrupted
> 2018-11-16 03:21:16,472 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor 
> thread interrupted
> 2018-11-16 03:21:16,472 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted
> 2018-11-16 03:21:16,473 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager 
> metrics system...
> 2018-11-16 03:21:16,475 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system stopped.
> 2018-11-16 03:21:16,475 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system shutdown complete.
> 2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> AsyncDispatcher is draining to stop, igonring any new events.
> 2018-11-16 03:21:16,477 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread
>  thread interrupted! Exiting!
> 2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x3671a89731f0000 closed
> 2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher
> 2018-11-16 03:21:16,490 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM:
>  NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms
> 2018-11-16 03:21:16,490 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager:
>  ContainerTokenKeyRollingInterval: 86400000ms and 
> ContainerTokenKeyActivationDelay: 900000ms
> 2018-11-16 03:21:16,490 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
>  AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 
> 900000 ms
> 2018-11-16 03:21:16,490 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: 
> Using RMStateStore implementation - class 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
> 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType 
> for class 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler
> 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for 
> class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager
> 2018-11-16 03:21:16,491 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using 
> Scheduler: 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler
> 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType
>  for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher
> 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher
> 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType
>  for class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher
> 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for 
> class 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher
> 2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: 
> loaded properties from hadoop-metrics2.properties
> 2018-11-16 03:21:16,493 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period 
> at 10 second(s).
> 2018-11-16 03:21:16,493 INFO 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics 
> system started
> 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager
> 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Registering class 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType 
> for class 
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher
> 2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed 
> to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance 
> already exists.
> 2018-11-16 03:21:16,494 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo 
> MBean
> 2018-11-16 03:21:16,494 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: 
> YARN system metrics publishing service is not enabled
> 2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: 
> Refreshing hosts (include/exclude) list
> 2018-11-16 03:21:16,496 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
> OPERATION=transitionToActive    TARGET=RMHAProtocolService      
> RESULT=FAILURE  DESCRIPTION=Exception transitioning to active   PERMISSIONS=
> 2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
> transitioning to Active mode
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132)
>               ... 4 more
> Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.IOException: Failed on local exception: java.net.SocketException: 
> Unresolved address; Host Details : local host is: "node-2.mydomain.com"; 
> destination host is: (unknown):0; 
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139)
>               at 
> org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65)
>               at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611)
>               at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128)
>               at java.security.AccessController.doPrivileged(Native Method)
>               at javax.security.auth.Subject.doAs(Subject.java:422)
>               at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128)
>               at 
> org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306)
>               ... 5 more
> Caused by: java.io.IOException: Failed on local exception: 
> java.net.SocketException: Unresolved address; Host Details : local host is: 
> "node-2.mydomain.com"; destination host is: (unknown):0; 
>               at 
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:522)
>               at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728)
>               at org.apache.hadoop.ipc.Server.<init>(Server.java:2449)
>               at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535)
>               at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510)
>               at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169)
>               at 
> org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132)
>               ... 20 more
> Caused by: java.net.SocketException: Unresolved address
>               at sun.nio.ch.Net.translateToSocketException(Net.java:131)
>               at sun.nio.ch.Net.translateException(Net.java:157)
>               at sun.nio.ch.Net.translateException(Net.java:163)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
>               at org.apache.hadoop.ipc.Server.bind(Server.java:505)
>               ... 28 more
> Caused by: java.nio.channels.UnresolvedAddressException
>               at sun.nio.ch.Net.checkAddress(Net.java:101)
>               at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
>               at 
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>               ... 29 more
> 2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Trying to re-establish ZK session
> 2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x36681eb8c720002 closed
> 2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating 
> client connection, 
> connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181
>  sessionTimeout=10000 
> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b
> 2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: 
> Unable to connect to server: node-2.mydomain.com:2181
> java.net.UnknownHostException: node-2.mydomain.com
>               at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
>               at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>               at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>               at 
> org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
>               at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
>               at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530)
>               at 
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546)
>               at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt 
> to authenticate using SASL (unknown error)
> 2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established, initiating session, client: /10.242.1.105:46773, 
> server: node-3.mydomain.com/10.242.1.106:2181
> 2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server node-3.mydomain.com/10.242.1.106:2181, 
> sessionid = 0x3671a89731f0003, negotiated timeout = 10000
> 2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Session connected.
> 2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x36681eb8c720002
> 2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found 
> resource yarn-site.xml at 
> file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml
> 2018-11-16 03:21:17,588 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
> OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=SUCCESS
> 2018-11-16 03:21:17,588 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in 
> standby state
> 2018-11-16 03:21:17,588 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer   
> OPERATION=transitionToStandby   TARGET=RMHAProtocolService      RESULT=SUCCESS
> 2018-11-16 03:30:57,669 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Release request cache is cleaned up
> 2018-11-16 03:31:16,496 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Release request cache is cleaned up
> 2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> 2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
> 2018-11-19 13:35:39,357 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for packer/node-2.mydomain.com@SA_REALM 
> (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
> 2018-11-19 13:35:45,785 WARN 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
> AuthenticationToken ignored: 
> org.apache.hadoop.security.authentication.util.SignerException: Invalid 
> signature
> 2018-11-21 08:29:19,995 WARN 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter: 
> AuthenticationToken ignored: 
> org.apache.hadoop.security.authentication.util.SignerException: Invalid 
> signature
> 2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM 
> (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed 
> [Caused by GSSException: No valid credentials provided (Mechanism level: 
> Failed to find any Kerberos tgt)]
> 2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
> 2018-11-21 08:29:23,666 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for packer/node-2.mydomain.com@SA_REALM 
> (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
> 2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: 
> Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS)
> 2018-11-21 08:31:37,258 INFO 
> SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager:
>  Authorization successful for packer/node-2.mydomain.com@SA_REALM 
> (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol
> {noformat}
> I have found a few tickets about some race conditions in YARN popping out 
> when issues with connecting to ZooKeeper occur but either they should have 
> been fix in 2.6.0 or the logs don't match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to