Good to hear the problem is resolved and able to continue. Regards, Rakesh
On Tue, Jul 19, 2016 at 10:31 PM, Alexandr Porunov < alexandr.poru...@gmail.com> wrote: > Rakesh, > > Thank you very much. I missed it. I hadn't "fuser" command on my nodes. > I've just installed it. ZKFC became work properly! > > Best regards, > Alexandr > > On Tue, Jul 19, 2016 at 5:29 PM, Rakesh Radhakrishnan <rake...@apache.org> > wrote: > >> Hi Alexandr, >> >> I could see the following warning message in your logs and is the reason >> for unsuccessful fencing. Could you please check 'fuser' command >> execution in your system. >> >> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: >> command not found >> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: >> 127 >> 2016-07-19 14:43:23,706 INFO >> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from >> hadoopActiveMaster port 22 >> >> Also, I'd suggest to visit >> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html >> page to understand more about the fencing logic. In this page you can >> search for "*dfs.ha.fencing.methods*" configuration. >> >> Regards, >> Rakesh >> >> On Tue, Jul 19, 2016 at 7:22 PM, Alexandr Porunov < >> alexandr.poru...@gmail.com> wrote: >> >>> Hello, >>> >>> I have a problem with ZKFC. >>> I have configured High Availability for Hadoop with QJM. >>> The problem is that when I turn off the active master node (or kill the >>> namenode process) standby node does not want to change its status from >>> standby to active. So it continues to be the standby node. >>> >>> I was watching the log file of ZKFC when I turned off the active node. >>> It started trying to connect to the active node (which already died) to >>> change its status from active to standby. >>> But the active node already died, so it is impossible to connect to the >>> dead active master node. >>> Then I turned on the active master node. After that my standby node >>> connected to the old active master node and changed the status of the >>> active node from active to standby and the status of standby node from >>> standby to active. >>> >>> It is really strange. After the crash of the active node the ZKFC wants >>> to connect to the dead node. Before connection is established ZKFC doesn't >>> want to change the status of standby node to active. >>> >>> Why is it happens? >>> >>> Here my log from zkfc (I cut it because it repeats all the time. After >>> this part of logs it logger writes the same thing): >>> >>> 2016-07-19 14:43:21,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Checking for any old active which needs to be fenced... >>> 2016-07-19 14:43:21,957 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Old node exists: 0a0a68612d636c757374657212036e6e311a12686164 >>> 6f6f704163746976654d617374657220d43e28d33e >>> 2016-07-19 14:43:21,978 INFO org.apache.hadoop.ha.ZKFailoverController: >>> Should fence: NameNode at hadoopActiveMaster/192.168.0.80:8020 >>> 2016-07-19 14:43:22,995 INFO org.apache.hadoop.ipc.Client: Retrying >>> connect to server: hadoopActiveMaster/192.168.0.80:8020. Already tried >>> 0 time(s); retry policy is >>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 >>> MILLISECONDS) >>> 2016-07-19 14:43:23,001 WARN org.apache.hadoop.ha.FailoverController: >>> Unable to gracefully make NameNode at hadoopActiveMaster/ >>> 192.168.0.80:8020 standby (unable to connect) >>> java.net.ConnectException: Call From hadoopStandby/192.168.0.81 to >>> hadoopActiveMaster:8020 failed on connection exception: >>> java.net.ConnectException: Connection refused; For more details see: >>> http://wiki.apache.org/hadoop/ConnectionRefused >>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native >>> Method) >>> at >>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) >>> at >>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) >>> at >>> java.lang.reflect.Constructor.newInstance(Constructor.java:423) >>> at >>> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) >>> at >>> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1479) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1412) >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) >>> at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source) >>> at >>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112) >>> at >>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) >>> Caused by: java.net.ConnectException: Connection refused >>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>> at >>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) >>> at >>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614) >>> at >>> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712) >>> at >>> org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375) >>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528) >>> at org.apache.hadoop.ipc.Client.call(Client.java:1451) >>> ... 14 more >>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: ====== >>> Beginning Service Fencing Process... ====== >>> 2016-07-19 14:43:23,007 INFO org.apache.hadoop.ha.NodeFencer: Trying >>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) >>> 2016-07-19 14:43:23,064 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >>> Connecting to hadoopActiveMaster... >>> 2016-07-19 14:43:23,066 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to >>> hadoopActiveMaster port 22 >>> 2016-07-19 14:43:23,073 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connection established >>> 2016-07-19 14:43:23,088 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Remote version string: >>> SSH-2.0-OpenSSH_6.6.1 >>> 2016-07-19 14:43:23,089 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Local version string: >>> SSH-2.0-JSCH-0.1.42 >>> 2016-07-19 14:43:23,089 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: CheckCiphers: >>> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256 >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-ctr is not available. >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-ctr is not available. >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes256-cbc is not available. >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: aes192-cbc is not available. >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: arcfour256 is not available. >>> 2016-07-19 14:43:23,445 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT sent >>> 2016-07-19 14:43:23,446 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXINIT received >>> 2016-07-19 14:43:23,446 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: server->client aes128-ctr >>> hmac-md5 none >>> 2016-07-19 14:43:23,446 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: kex: client->server aes128-ctr >>> hmac-md5 none >>> 2016-07-19 14:43:23,478 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_KEXDH_INIT sent >>> 2016-07-19 14:43:23,479 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: expecting SSH_MSG_KEXDH_REPLY >>> 2016-07-19 14:43:23,493 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: ssh_rsa_verify: signature true >>> 2016-07-19 14:43:23,495 WARN >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Permanently added >>> 'hadoopActiveMaster' (RSA) to the list of known hosts. >>> 2016-07-19 14:43:23,495 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS sent >>> 2016-07-19 14:43:23,495 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_NEWKEYS received >>> 2016-07-19 14:43:23,519 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_REQUEST sent >>> 2016-07-19 14:43:23,519 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: SSH_MSG_SERVICE_ACCEPT received >>> 2016-07-19 14:43:23,524 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can >>> continue: gssapi-with-mic,publickey,keyboard-interactive,password >>> 2016-07-19 14:43:23,524 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: >>> gssapi-with-mic >>> 2016-07-19 14:43:23,527 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentications that can >>> continue: publickey,keyboard-interactive,password >>> 2016-07-19 14:43:23,527 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Next authentication method: >>> publickey >>> 2016-07-19 14:43:23,617 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Authentication succeeded >>> (publickey). >>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >>> Connected to hadoopActiveMaster >>> 2016-07-19 14:43:23,624 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >>> Looking for process running on port 8020 >>> 2016-07-19 14:43:23,705 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >>> PATH=$PATH:/sbin:/usr/sbin fuser -v -k -n tcp 8020 via ssh: bash: fuser: >>> command not found >>> 2016-07-19 14:43:23,706 INFO org.apache.hadoop.ha.SshFenceByTcpPort: rc: >>> 127 >>> 2016-07-19 14:43:23,706 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Disconnecting from >>> hadoopActiveMaster port 22 >>> 2016-07-19 14:43:23,717 WARN org.apache.hadoop.ha.NodeFencer: Fencing >>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. >>> 2016-07-19 14:43:23,718 ERROR org.apache.hadoop.ha.NodeFencer: Unable to >>> fence service by any configured method. >>> 2016-07-19 14:43:23,719 WARN org.apache.hadoop.ha.ActiveStandbyElector: >>> Exception handling the winning of election >>> java.lang.RuntimeException: Unable to fence NameNode at >>> hadoopActiveMaster/192.168.0.80:8020 >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) >>> 2016-07-19 14:43:23,719 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Trying to re-establish ZK session >>> 2016-07-19 14:43:23,725 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Caught an exception, leaving >>> main loop due to Socket closed >>> 2016-07-19 14:43:23,746 INFO org.apache.zookeeper.ZooKeeper: Session: >>> 0x35602bbb71e0002 closed >>> 2016-07-19 14:43:24,750 INFO org.apache.zookeeper.ZooKeeper: Initiating >>> client connection, >>> connectString=hadoopActiveMaster:2181,hadoopStandby:2181,hadoopSlave1:2181 >>> sessionTimeout=5000 >>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@6a02f3d6 >>> 2016-07-19 14:43:24,760 INFO org.apache.zookeeper.ClientCnxn: Opening >>> socket connection to server hadoopActiveMaster/192.168.0.80:2181. Will >>> not attempt to authenticate using SASL (unknown error) >>> 2016-07-19 14:43:24,762 INFO org.apache.zookeeper.ClientCnxn: Socket >>> connection established to hadoopActiveMaster/192.168.0.80:2181, >>> initiating session >>> 2016-07-19 14:43:24,773 INFO org.apache.zookeeper.ClientCnxn: Session >>> establishment complete on server hadoopActiveMaster/192.168.0.80:2181, >>> sessionid = 0x15602bba9e00003, negotiated timeout = 5000 >>> 2016-07-19 14:43:24,778 INFO org.apache.zookeeper.ClientCnxn: >>> EventThread shut down >>> 2016-07-19 14:43:24,782 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Session connected. >>> >>> >>> Please, help me to solve the problem with the configuration of Hadoop HA >>> >>> Sincerely, >>> Alexandr >>> >> >> >