Hi folks, I am facing a strange issue. I've a kerberized 5 node cluster ( with two HDFS NN masters) with Name Nodes running in the HA , say master-1 and master-2. Master-2 also hosts YARN resource manager, History server,etc. Last week we shutdown all the machines. Out of which master node (master-2) is not starting due to "unknown" reasons..
Today on the Edge node, I tried issuing the HDFS command (hadoop fs -ls /) it could list anything but exceptions : Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB Moreover, I see that NN running on the master-1 is also shutting down automatically. I started again but again it goes down. This is kind of strange. Here is the output of HDFS command and few logs. I would be grateful if someone can help on this. Thanks, DP *[root@edgenode ~]# hadoop fs -ls /* 15/12/02 16:52:32 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over <Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8020 after 1 fail over attempts. Trying to fail over after sleeping for 933ms. org.apache.hadoop.net.ConnectTimeoutException: Call From <Host-NameNodeHost-1> /<NameNodeHost1-IP> to <Host-NameNodeHost-2>:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) .......... ^C15/12/02 16:52:38 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over <Host-NameNodeHost-1> /IP-Namenode-1:8020 after 2 fail over attempts. Trying to fail over after sleeping for 1527ms. java.net.ConnectException: Call From <Host-NameNodeHost-1>/<NameNodeHost1-IP> to <Host-NameNodeHost-1> :8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) *vi /var/log/hadoop/hdfs/hadoop-hdfs-namenode-HostNameNode-1.log* com/IP-Namenode-2:8485. Already tried 35 time(s); maxRetries=45 2015-12-02 16:27:08,639 INFO ipc.Server (Server.java:saslProcess(1383)) - Auth successful for nn/<Hostname>@KDCRealm (auth:KERBEROS) 2015-12-02 16:27:08,703 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(135)) - Authorization successful for nn/<Host-NameNodeHost-1> @KDCRealm (auth:KERBEROS) for protocol=interface org.apache.hadoop.ha.HAServiceProtocol 2015-12-02 16:27:15,223 INFO ipc.Client (Client.java:handleConnectionTimeout(835)) - Retrying connect to server: <Host-NameNodeHost-2>/<IP_NameNode_Host-2>:8485. Already tried 36 time(s); maxRetries=45 2015-12-02 16:27:17,201 INFO queue.AuditFileSpool (AuditFileSpool.java:runDoAs(780)) - Destination is down. sleeping for 30000 milli seconds. indexQueue=0, queueName=hdfs.async.summary.multi_dest.batch, consumer=hdfs.async.summary.multi_dest.batch.hdfs 2015-12-02 16:27:17,230 INFO queue.AuditFileSpool (AuditFileSpool.java:runDoAs(780)) - Destination is down. sleeping for 30000 milli seconds. indexQueue=0, queueName=hdfs.async.summary.multi_dest.batch, consumer=hdfs.async.summary.multi_dest.batch.db 2015-12-02 16:27:21,996 INFO ipc.Server (Server.java:saslProcess(1383)) - Auth successful for admin/admin@KDCRealm (auth:KERBEROS) *vi /var/log/hadoop/hdfs/hadoop-hdfs-zkfc-HostNameNode-1.log* 2015-12-02 16:18:40,726 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of NameNode at <HostNameNode-1>/<IPofNameNode-1>:8020: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<IPofNameNode-1>:58279 remote=<HostNameNode-1>/<IPofNameNode-1>:8020] Call From <HostNameNode-1>/<IPofNameNode-1> to <HostNameNode-1>:8020 failed on socket timeout exception: java.net.SocketTimeoutException: 45000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<IPofNameNode-1>:58279 remote=<HostNameNode-1>/<IPofNameNode-1>:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout