[ https://issues.apache.org/jira/browse/YARN-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zbigniew Kostrzewa updated YARN-9064: ------------------------------------- Description: I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos and configured in HA. The first 3 nodes hold both slave and master services: * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History Server, DataNode, NodeManager, ZooKeeper and Kerberos * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, ZooKeeper and Kerberos * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper * Node-4..Node-31: DataNode and NodeManager At one moment there was a problem with the switch the nodes were connected to and all the services started loosing connectivity. # At first Kerberos stopped granting any tickets # This broke the cluster as Hadoop services could not authenticate to each other. # At some point ZooKeeper cluster lost leader and started re-election. # This resulted in multiple ZooKeeper-related errors and warnings in ResourceManager and ZKFC logs. # After a while, when the issue with the switch was resolved most of services recovered automatically # "Most" except YARN: ## both ResourceManager were stuck in standby mode ## all NodeManagers were shutdown # I have managed to recover YARN, however it required manual restart of both ResourceManagers (and starting all NodeManagers) I have all the logs from the incident but the most important seem to be those: {noformat} 2018-11-16 03:21:16,420 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1539778834071_0622_000001 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1539778834071_0622_000001 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1539778834071_0622 State change from NEW to ACCEPTED on event = RECOVER 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 622 out of 622 applications 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 1 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended 2018-11-16 03:21:16,425 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1539778834071_0622_000002 2018-11-16 03:21:16,426 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on event = START 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 32 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey. 2018-11-16 03:21:16,440 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2018-11-16 03:21:16,441 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2018-11-16 03:21:16,444 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 33 2018-11-16 03:21:16,445 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey. 2018-11-16 03:21:16,458 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1539778834071_0622 requests cleared 2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler from user packer 2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED 2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000 2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,470 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,472 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,472 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system... 2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped. 2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete. 2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events. 2018-11-16 03:21:16,477 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting! 2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 0x3671a89731f0000 closed 2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: Using RMStateStore implementation - class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher 2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system started 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher 2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already exists. 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo MBean 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: YARN system metrics publishing service is not enabled 2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2018-11-16 03:21:16,496 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=transitionToActive TARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) ... 5 more Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 0x36681eb8c720002 closed 2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b 2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: Unable to connect to server: node-2.mydomain.com:2181 java.net.UnknownHostException: node-2.mydomain.com at java.net.InetAddress.getAllByName0(InetAddress.java:1280) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630) at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774) at org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749) at org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660) at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688) at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to authenticate using SASL (unknown error) 2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.242.1.105:46773, server: node-3.mydomain.com/10.242.1.106:2181 2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-3.mydomain.com/10.242.1.106:2181, sessionid = 0x3671a89731f0003, negotiated timeout = 10000 2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x36681eb8c720002 2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=transitionToStandby TARGET=RMHAProtocolService RESULT=SUCCESS 2018-11-16 03:30:57,669 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up 2018-11-16 03:31:16,496 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up 2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-19 13:35:39,357 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol 2018-11-19 13:35:45,785 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature 2018-11-21 08:29:19,995 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature 2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-21 08:29:23,666 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol 2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-21 08:31:37,258 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol {noformat} I have found a few tickets about some race conditions in YARN popping out when issues with connecting to ZooKeeper occur but either they should have been fix in 2.6.0 or the logs don't match. was: I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos and configured in HA. The first 3 nodes hold both slave and master services: * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History Server, DataNode, NodeManager, ZooKeeper and Kerberos * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, NodeManager, ZooKeeper and Kerberos * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper * Node-4..Node-31: DataNode and NodeManager At one moment there was a problem with the switch the nodes were connected to and all the services started loosing connectivity. 1. At first Kerberos stopped granting any tickets 2. This broke the cluster as Hadoop services could not authenticate to each other. 3. At some point ZooKeeper cluster lost leader and started re-election. 4. This resulted in multiple ZooKeeper-related errors and warnings in ResourceManager and ZKFC logs. 5. After a while, when the issue with the switch was resolved most of services recovered automatically 6. "Most" except YARN: a. both ResourceManager were stuck in standby mode b. all NodeManagers were shutdown 7. I have managed to recover YARN, however it required manual restart of both ResourceManagers (and starting all NodeManagers) I have all the logs from the incident but the most important seem to be those: {noformat} 2018-11-16 03:21:16,420 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Unregistering app attempt : appattempt_1539778834071_0622_000001 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Application finished, removing password for appattempt_1539778834071_0622_000001 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1539778834071_0622 State change from NEW to ACCEPTED on event = RECOVER 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully recovered 622 out of 622 applications 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of failed attempts is 0. The max attempts is 1 2018-11-16 03:21:16,424 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended 2018-11-16 03:21:16,425 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1539778834071_0622_000002 2018-11-16 03:21:16,426 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on event = START 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: Rolling master-key for container-tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: Rolling master-key for nm-tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 32 2018-11-16 03:21:16,427 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey. 2018-11-16 03:21:16,440 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s) 2018-11-16 03:21:16,441 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens 2018-11-16 03:21:16,444 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: storing master key with keyID 33 2018-11-16 03:21:16,445 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing RMDTMasterKey. 2018-11-16 03:21:16,458 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: Application application_1539778834071_0622 requests cleared 2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler from user packer 2018-11-16 03:21:16,459 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED on event = ATTEMPT_ADDED 2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000 2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,470 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, interrupted : java.lang.InterruptedException 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer thread interrupted 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,471 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,472 INFO org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor thread interrupted 2018-11-16 03:21:16,472 ERROR org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted 2018-11-16 03:21:16,473 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager metrics system... 2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system stopped. 2018-11-16 03:21:16,475 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system shutdown complete. 2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher is draining to stop, igonring any new events. 2018-11-16 03:21:16,477 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread thread interrupted! Exiting! 2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: 0x3671a89731f0000 closed 2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: ContainerTokenKeyRollingInterval: 86400000ms and ContainerTokenKeyActivationDelay: 900000ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: 900000 ms 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: Using RMStateStore implementation - class org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType for class org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using Scheduler: org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for class org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher 2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2018-11-16 03:21:16,493 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics system started 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class org.apache.hadoop.yarn.server.resourcemanager.RMAppManager 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType for class org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher 2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance already exists. 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo MBean 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: YARN system metrics publishing service is not enabled 2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list 2018-11-16 03:21:16,496 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=transitionToActive TARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) ... 4 more Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) ... 5 more Caused by: java.io.IOException: Failed on local exception: java.net.SocketException: Unresolved address; Host Details : local host is: "node-2.mydomain.com"; destination host is: (unknown):0; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) at org.apache.hadoop.ipc.Server.bind(Server.java:522) at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) ... 20 more Caused by: java.net.SocketException: Unresolved address at sun.nio.ch.Net.translateToSocketException(Net.java:131) at sun.nio.ch.Net.translateException(Net.java:157) at sun.nio.ch.Net.translateException(Net.java:163) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) at org.apache.hadoop.ipc.Server.bind(Server.java:505) ... 28 more Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) ... 29 more 2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: 0x36681eb8c720002 closed 2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181 sessionTimeout=10000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b 2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: Unable to connect to server: node-2.mydomain.com:2181 java.net.UnknownHostException: node-2.mydomain.com at java.net.InetAddress.getAllByName0(InetAddress.java:1280) at java.net.InetAddress.getAllByName(InetAddress.java:1192) at java.net.InetAddress.getAllByName(InetAddress.java:1126) at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) at org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630) at org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774) at org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749) at org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660) at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688) at org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt to authenticate using SASL (unknown error) 2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.242.1.105:46773, server: node-3.mydomain.com/10.242.1.106:2181 2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server node-3.mydomain.com/10.242.1.106:2181, sessionid = 0x3671a89731f0003, negotiated timeout = 10000 2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected. 2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x36681eb8c720002 2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found resource yarn-site.xml at file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in standby state 2018-11-16 03:21:17,588 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer OPERATION=transitionToStandby TARGET=RMHAProtocolService RESULT=SUCCESS 2018-11-16 03:30:57,669 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up 2018-11-16 03:31:16,496 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: Release request cache is cleaned up 2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-19 13:35:39,357 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol 2018-11-19 13:35:45,785 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature 2018-11-21 08:29:19,995 WARN org.apache.hadoop.security.authentication.server.AuthenticationFilter: AuthenticationToken ignored: org.apache.hadoop.security.authentication.util.SignerException: Invalid signature 2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] 2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-21 08:29:23,666 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol 2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) 2018-11-21 08:31:37,258 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol {noformat} I have found a few tickets about some race conditions in YARN popping out when issues with connecting to ZooKeeper occur but either they should have been fix in 2.6.0 or the logs don't match. > Both Resource Managers stay in standby after connection to ZooKeeper was > recovered > ---------------------------------------------------------------------------------- > > Key: YARN-9064 > URL: https://issues.apache.org/jira/browse/YARN-9064 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, yarn > Affects Versions: 2.6.0 > Environment: * cluster of 31 nodes > * each node is a VM with 60GB of RAM and 8 vcpus > * each VM is running CentOS 7.2 with Hadoop 2.6.0 > * Hadoop cluster is secured with Kerberos > * Hadoop cluster is configured with HA > Reporter: Zbigniew Kostrzewa > Priority: Major > > I have a Hadoop 2.6.0 31 nodes' cluster. The cluster is secured with Kerberos > and configured in HA. The first 3 nodes hold both slave and master services: > * Node-1: NameNode, ResourceManager, JournalNode, ZKFC, MapRed Job History > Server, DataNode, NodeManager, ZooKeeper and Kerberos > * Node-2: NameNode, ResourceManager, JournalNode, ZKFC, DataNode, > NodeManager, ZooKeeper and Kerberos > * Node-3: JournalNode, DataNode, NodeManager and ZooKeeper > * Node-4..Node-31: DataNode and NodeManager > At one moment there was a problem with the switch the nodes were connected to > and all the services started loosing connectivity. > # At first Kerberos stopped granting any tickets > # This broke the cluster as Hadoop services could not authenticate to each > other. > # At some point ZooKeeper cluster lost leader and started re-election. > # This resulted in multiple ZooKeeper-related errors and warnings in > ResourceManager and ZKFC logs. > # After a while, when the issue with the switch was resolved most of services > recovered automatically > # "Most" except YARN: > ## both ResourceManager were stuck in standby mode > ## all NodeManagers were shutdown > # I have managed to recover YARN, however it required manual restart of both > ResourceManagers (and starting all NodeManagers) > I have all the logs from the incident but the most important seem to be those: > {noformat} > 2018-11-16 03:21:16,420 INFO > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > Unregistering app attempt : appattempt_1539778834071_0622_000001 > 2018-11-16 03:21:16,424 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > Application finished, removing password for > appattempt_1539778834071_0622_000001 > 2018-11-16 03:21:16,424 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: > application_1539778834071_0622 State change from NEW to ACCEPTED on event = > RECOVER > 2018-11-16 03:21:16,424 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully > recovered 622 out of 622 applications > 2018-11-16 03:21:16,424 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The number of > failed attempts is 0. The max attempts is 1 > 2018-11-16 03:21:16,424 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery ended > 2018-11-16 03:21:16,425 INFO > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: > Registering app attempt : appattempt_1539778834071_0622_000002 > 2018-11-16 03:21:16,426 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1539778834071_0622_000002 State change from NEW to SUBMITTED on > event = START > 2018-11-16 03:21:16,427 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: > Rolling master-key for container-tokens > 2018-11-16 03:21:16,427 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: > Rolling master-key for nm-tokens > 2018-11-16 03:21:16,427 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2018-11-16 03:21:16,427 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: > storing master key with keyID 32 > 2018-11-16 03:21:16,427 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing > RMDTMasterKey. > 2018-11-16 03:21:16,440 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Starting expired delegation token remover thread, > tokenRemoverScanInterval=60 min(s) > 2018-11-16 03:21:16,441 INFO > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > Updating the current master key for generating delegation tokens > 2018-11-16 03:21:16,444 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager: > storing master key with keyID 33 > 2018-11-16 03:21:16,445 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing > RMDTMasterKey. > 2018-11-16 03:21:16,458 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo: > Application application_1539778834071_0622 requests cleared > 2018-11-16 03:21:16,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: > Added Application Attempt appattempt_1539778834071_0622_000002 to scheduler > from user packer > 2018-11-16 03:21:16,459 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: > appattempt_1539778834071_0622_000002 State change from SUBMITTED to SCHEDULED > on event = ATTEMPT_ADDED > 2018-11-16 03:21:16,459 INFO org.apache.hadoop.ipc.CallQueueManager: Using > callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 5000 > 2018-11-16 03:21:16,460 INFO org.apache.hadoop.service.AbstractService: > Service org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService > failed in state STARTED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: > Failed on local exception: java.net.SocketException: Unresolved address; Host > Details : local host is: "node-2.mydomain.com"; destination host is: > (unknown):0; > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: > Failed on local exception: java.net.SocketException: Unresolved address; Host > Details : local host is: "node-2.mydomain.com"; destination host is: > (unknown):0; > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: java.io.IOException: Failed on local exception: > java.net.SocketException: Unresolved address; Host Details : local host is: > "node-2.mydomain.com"; destination host is: (unknown):0; > at > org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Server.bind(Server.java:522) > at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) > at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) > at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) > at > org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) > at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) > ... 20 more > Caused by: java.net.SocketException: Unresolved address > at sun.nio.ch.Net.translateToSocketException(Net.java:131) > at sun.nio.ch.Net.translateException(Net.java:157) > at sun.nio.ch.Net.translateException(Net.java:163) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) > at org.apache.hadoop.ipc.Server.bind(Server.java:505) > ... 28 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > ... 29 more > 2018-11-16 03:21:16,464 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state STARTED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: > Failed on local exception: java.net.SocketException: Unresolved address; Host > Details : local host is: "node-2.mydomain.com"; destination host is: > (unknown):0; > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: > Failed on local exception: java.net.SocketException: Unresolved address; Host > Details : local host is: "node-2.mydomain.com"; destination host is: > (unknown):0; > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: java.io.IOException: Failed on local exception: > java.net.SocketException: Unresolved address; Host Details : local host is: > "node-2.mydomain.com"; destination host is: (unknown):0; > at > org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Server.bind(Server.java:522) > at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) > at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) > at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) > at > org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) > at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) > ... 20 more > Caused by: java.net.SocketException: Unresolved address > at sun.nio.ch.Net.translateToSocketException(Net.java:131) > at sun.nio.ch.Net.translateException(Net.java:157) > at sun.nio.ch.Net.translateException(Net.java:163) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) > at org.apache.hadoop.ipc.Server.bind(Server.java:505) > ... 28 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > ... 29 more > 2018-11-16 03:21:16,470 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Returning, > interrupted : java.lang.InterruptedException > 2018-11-16 03:21:16,471 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.ContainerAllocationExpirer > thread interrupted > 2018-11-16 03:21:16,471 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor > thread interrupted > 2018-11-16 03:21:16,471 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: NMLivelinessMonitor > thread interrupted > 2018-11-16 03:21:16,472 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: AMLivelinessMonitor > thread interrupted > 2018-11-16 03:21:16,472 ERROR > org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: > ExpiredTokenRemover received java.lang.InterruptedException: sleep > interrupted > 2018-11-16 03:21:16,473 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping ResourceManager > metrics system... > 2018-11-16 03:21:16,475 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system stopped. > 2018-11-16 03:21:16,475 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system shutdown complete. > 2018-11-16 03:21:16,475 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > AsyncDispatcher is draining to stop, igonring any new events. > 2018-11-16 03:21:16,477 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread > thread interrupted! Exiting! > 2018-11-16 03:21:16,487 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x3671a89731f0000 closed > 2018-11-16 03:21:16,488 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2018-11-16 03:21:16,489 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.RMFatalEventType for class > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher > 2018-11-16 03:21:16,490 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.NMTokenSecretManagerInRM: > NMTokenKeyRollingInterval: 86400000ms and NMTokenKeyActivationDelay: 900000ms > 2018-11-16 03:21:16,490 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager: > ContainerTokenKeyRollingInterval: 86400000ms and > ContainerTokenKeyActivationDelay: 900000ms > 2018-11-16 03:21:16,490 INFO > org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: > AMRMTokenKeyRollingInterval: 86400000ms and AMRMTokenKeyActivationDelay: > 900000 ms > 2018-11-16 03:21:16,490 INFO > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreFactory: > Using RMStateStore implementation - class > org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore > 2018-11-16 03:21:16,490 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStoreEventType > for class > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler > 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEventType for > class org.apache.hadoop.yarn.server.resourcemanager.NodesListManager > 2018-11-16 03:21:16,491 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Using > Scheduler: > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler > 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.SchedulerEventType > for class > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher > 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppEventType for class > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher > 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEventType > for class > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher > 2018-11-16 03:21:16,491 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeEventType for > class > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$NodeEventDispatcher > 2018-11-16 03:21:16,492 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: > loaded properties from hadoop-metrics2.properties > 2018-11-16 03:21:16,493 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period > at 10 second(s). > 2018-11-16 03:21:16,493 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ResourceManager metrics > system started > 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.RMAppManagerEventType for class > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager > 2018-11-16 03:21:16,494 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: > Registering class > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncherEventType > for class > org.apache.hadoop.yarn.server.resourcemanager.amlauncher.ApplicationMasterLauncher > 2018-11-16 03:21:16,494 WARN org.apache.hadoop.metrics2.util.MBeans: Failed > to register MBean "Hadoop:service=ResourceManager,name=RMNMInfo": Instance > already exists. > 2018-11-16 03:21:16,494 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMNMInfo: Registered RMNMInfo > MBean > 2018-11-16 03:21:16,494 INFO > org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher: > YARN system metrics publishing service is not enabled > 2018-11-16 03:21:16,494 INFO org.apache.hadoop.util.HostsFileReader: > Refreshing hosts (include/exclude) list > 2018-11-16 03:21:16,496 WARN > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer > OPERATION=transitionToActive TARGET=RMHAProtocolService > RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= > 2018-11-16 03:21:16,497 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:134) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:812) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:483) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when > transitioning to Active mode > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:311) > at > org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:132) > ... 4 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > java.io.IOException: Failed on local exception: java.net.SocketException: > Unresolved address; Host Details : local host is: "node-2.mydomain.com"; > destination host is: (unknown):0; > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:139) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService.serviceStart(ResourceTrackerService.java:163) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:611) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1091) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1132) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1128) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1128) > at > org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:306) > ... 5 more > Caused by: java.io.IOException: Failed on local exception: > java.net.SocketException: Unresolved address; Host Details : local host is: > "node-2.mydomain.com"; destination host is: (unknown):0; > at > org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Server.bind(Server.java:522) > at org.apache.hadoop.ipc.Server$Listener.<init>(Server.java:728) > at org.apache.hadoop.ipc.Server.<init>(Server.java:2449) > at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1042) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:535) > at > org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:510) > at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:887) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) > ... 20 more > Caused by: java.net.SocketException: Unresolved address > at sun.nio.ch.Net.translateToSocketException(Net.java:131) > at sun.nio.ch.Net.translateException(Net.java:157) > at sun.nio.ch.Net.translateException(Net.java:163) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76) > at org.apache.hadoop.ipc.Server.bind(Server.java:505) > ... 28 more > Caused by: java.nio.channels.UnresolvedAddressException > at sun.nio.ch.Net.checkAddress(Net.java:101) > at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) > at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > ... 29 more > 2018-11-16 03:21:16,497 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Trying to re-establish ZK session > 2018-11-16 03:21:16,511 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x36681eb8c720002 closed > 2018-11-16 03:21:17,513 INFO org.apache.zookeeper.ZooKeeper: Initiating > client connection, > connectString=node-1.mydomain.com:2181,node-1.mydomain.com:2181,node-1.mydomain.com:2181 > sessionTimeout=10000 > watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@655d597b > 2018-11-16 03:21:17,513 ERROR org.apache.zookeeper.client.StaticHostProvider: > Unable to connect to server: node-2.mydomain.com:2181 > java.net.UnknownHostException: node-2.mydomain.com > at java.net.InetAddress.getAllByName0(InetAddress.java:1280) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at > org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) > at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380) > at > org.apache.hadoop.ha.ActiveStandbyElector.getNewZooKeeper(ActiveStandbyElector.java:630) > at > org.apache.hadoop.ha.ActiveStandbyElector.createConnection(ActiveStandbyElector.java:774) > at > org.apache.hadoop.ha.ActiveStandbyElector.reEstablishSession(ActiveStandbyElector.java:749) > at > org.apache.hadoop.ha.ActiveStandbyElector.joinElectionInternal(ActiveStandbyElector.java:660) > at > org.apache.hadoop.ha.ActiveStandbyElector.reJoinElection(ActiveStandbyElector.java:688) > at > org.apache.hadoop.ha.ActiveStandbyElector.reJoinElectionAfterFailureToBecomeActive(ActiveStandbyElector.java:530) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:484) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:546) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > 2018-11-16 03:21:17,559 INFO org.apache.zookeeper.ClientCnxn: Opening socket > connection to server node-3.mydomain.com/10.242.1.106:2181. Will not attempt > to authenticate using SASL (unknown error) > 2018-11-16 03:21:17,560 INFO org.apache.zookeeper.ClientCnxn: Socket > connection established, initiating session, client: /10.242.1.105:46773, > server: node-3.mydomain.com/10.242.1.106:2181 > 2018-11-16 03:21:17,573 INFO org.apache.zookeeper.ClientCnxn: Session > establishment complete on server node-3.mydomain.com/10.242.1.106:2181, > sessionid = 0x3671a89731f0003, negotiated timeout = 10000 > 2018-11-16 03:21:17,575 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Session connected. > 2018-11-16 03:21:17,575 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x36681eb8c720002 > 2018-11-16 03:21:17,575 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2018-11-16 03:21:17,585 INFO org.apache.hadoop.conf.Configuration: found > resource yarn-site.xml at > file:/hadoop-2.6.0-cdh5.14.0/etc/hadoop/yarn-site.xml > 2018-11-16 03:21:17,588 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer > OPERATION=refreshAdminAcls TARGET=AdminService RESULT=SUCCESS > 2018-11-16 03:21:17,588 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Already in > standby state > 2018-11-16 03:21:17,588 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=packer > OPERATION=transitionToStandby TARGET=RMHAProtocolService RESULT=SUCCESS > 2018-11-16 03:30:57,669 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Release request cache is cleaned up > 2018-11-16 03:31:16,496 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler: > Release request cache is cleaned up > 2018-11-19 13:35:36,554 WARN org.apache.hadoop.security.UserGroupInformation: > PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM > (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt)] > 2018-11-19 13:35:39,353 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) > 2018-11-19 13:35:39,357 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for packer/node-2.mydomain.com@SA_REALM > (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol > 2018-11-19 13:35:45,785 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > AuthenticationToken ignored: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > 2018-11-21 08:29:19,995 WARN > org.apache.hadoop.security.authentication.server.AuthenticationFilter: > AuthenticationToken ignored: > org.apache.hadoop.security.authentication.util.SignerException: Invalid > signature > 2018-11-21 08:29:20,001 WARN org.apache.hadoop.security.UserGroupInformation: > PriviledgedActionException as:packer/node-2.mydomain.com@SA_REALM > (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed > [Caused by GSSException: No valid credentials provided (Mechanism level: > Failed to find any Kerberos tgt)] > 2018-11-21 08:29:23,662 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) > 2018-11-21 08:29:23,666 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for packer/node-2.mydomain.com@SA_REALM > (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol > 2018-11-21 08:31:37,254 INFO SecurityLogger.org.apache.hadoop.ipc.Server: > Auth successful for packer/node-2.mydomain.com@SA_REALM (auth:KERBEROS) > 2018-11-21 08:31:37,258 INFO > SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: > Authorization successful for packer/node-2.mydomain.com@SA_REALM > (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol > {noformat} > I have found a few tickets about some race conditions in YARN popping out > when issues with connecting to ZooKeeper occur but either they should have > been fix in 2.6.0 or the logs don't match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org