Hi Pavan
I'm using sshfence ------core-site.xml----------------- <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://lklcluster</value> <final>true</final> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/tmp2</value> </property> </configuration> -------hdfs-site.xml------------- <configuration> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/namedir2</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/datadir2</value> </property> <property> <name>dfs.nameservices</name> <value>lklcluster</value> </property> <property> <name>dfs.ha.namenodes.lklcluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.lklcluster.nn1</name> <value>hadoop2:8020</value> </property> <property> <name>dfs.namenode.rpc-address.lklcluster.nn2</name> <value>hadoop3:8020</value> </property> <property> <name>dfs.namenode.http-address.lklcluster.nn1</name> <value>hadoop2:50070</value> </property> <property> <name>dfs.namenode.http-address.lklcluster.nn2</name> <value>hadoop3:50070</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value> </property> <property> <name>dfs.client.failover.proxy.provider.lklcluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/hadoop/.ssh/id_rsa</value> </property> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>5000</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/home/hadoop/journal/data</value> </property> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <property> <name>ha.zookeeper.quorum</name> <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value> </property> </configuration> 2013/12/2 Pavan Kumar Polineni <smartsunny...@gmail.com> > Post your config files and in which method you are following for automatic > failover > > > On Mon, Dec 2, 2013 at 5:34 PM, YouPeng Yang <yypvsxf19870...@gmail.com>wrote: > >> Hi i >> I'm testing the HA auto-failover within hadoop-2.2.0 >> >> The cluster can be manully failover ,however failed with the automatic >> failover. >> I setup the HA according to the URL >> >> http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html >> >> When I test the automatic failover, I killed my active NN by kill -9 >> <Pid-nn>,while the standby namenode does not change to active state. >> It came out the log in my DFSZKFailoverController as [1] >> >> Please help me ,any suggestion will be appreciated. >> >> >> Regards. >> >> >> zkfc >> log[1]---------------------------------------------------------------------------------------------------- >> >> 2013-12-02 19:49:28,588 INFO org.apache.hadoop.ha.NodeFencer: ====== >> Beginning Service Fencing Process... ====== >> 2013-12-02 19:49:28,588 INFO org.apache.hadoop.ha.NodeFencer: Trying >> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) >> 2013-12-02 19:49:28,590 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >> Connecting to hadoop3... >> 2013-12-02 19:49:28,590 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Connecting to hadoop3 port 22 >> 2013-12-02 19:49:28,592 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Connection established >> 2013-12-02 19:49:28,603 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Remote version string: SSH-2.0-OpenSSH_5.3 >> 2013-12-02 19:49:28,603 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Local version string: SSH-2.0-JSCH-0.1.42 >> 2013-12-02 19:49:28,603 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> CheckCiphers: >> aes256-ctr,aes192-ctr,aes128-ctr,aes256-cbc,aes192-cbc,aes128-cbc,3des-ctr,arcfour,arcfour128,arcfour256 >> 2013-12-02 19:49:28,608 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes256-ctr is not available. >> 2013-12-02 19:49:28,608 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes192-ctr is not available. >> 2013-12-02 19:49:28,608 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes256-cbc is not available. >> 2013-12-02 19:49:28,608 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> aes192-cbc is not available. >> 2013-12-02 19:49:28,609 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> arcfour256 is not available. >> 2013-12-02 19:49:28,610 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXINIT sent >> 2013-12-02 19:49:28,610 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXINIT received >> 2013-12-02 19:49:28,610 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> kex: server->client aes128-ctr hmac-md5 none >> 2013-12-02 19:49:28,610 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> kex: client->server aes128-ctr hmac-md5 none >> 2013-12-02 19:49:28,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_KEXDH_INIT sent >> 2013-12-02 19:49:28,617 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> expecting SSH_MSG_KEXDH_REPLY >> 2013-12-02 19:49:28,634 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> ssh_rsa_verify: signature true >> 2013-12-02 19:49:28,635 WARN org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Permanently added 'hadoop3' (RSA) to the list of known hosts. >> 2013-12-02 19:49:28,635 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_NEWKEYS sent >> 2013-12-02 19:49:28,635 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_NEWKEYS received >> 2013-12-02 19:49:28,636 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_SERVICE_REQUEST sent >> 2013-12-02 19:49:28,637 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> SSH_MSG_SERVICE_ACCEPT received >> 2013-12-02 19:49:28,638 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Authentications that can continue: >> gssapi-with-mic,publickey,keyboard-interactive,password >> 2013-12-02 19:49:28,639 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Next authentication method: gssapi-with-mic >> 2013-12-02 19:49:28,642 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Authentications that can continue: publickey,keyboard-interactive,password >> 2013-12-02 19:49:28,642 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Next authentication method: publickey >> 2013-12-02 19:49:28,644 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: >> Disconnecting from hadoop3 port 22 >> 2013-12-02 19:49:28,644 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >> Unable to connect to hadoop3 as user hadoop >> com.jcraft.jsch.JSchException: Auth fail >> at com.jcraft.jsch.Session.connect(Session.java:452) >> at >> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) >> at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) >> at >> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) >> at >> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >> at >> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >> at >> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >> 2013-12-02 19:49:28,644 WARN org.apache.hadoop.ha.NodeFencer: Fencing >> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. >> 2013-12-02 19:49:28,645 ERROR org.apache.hadoop.ha.NodeFencer: Unable to >> fence service by any configured method. >> 2013-12-02 19:49:28,645 INFO org.apache.hadoop.ha.ZKFailoverController: >> Local service NameNode at hadoop2/10.7.23.125:8020 entered state: >> SERVICE_NOT_RESPONDING >> 2013-12-02 19:49:28,646 WARN org.apache.hadoop.ha.ActiveStandbyElector: >> Exception handling the winning of election >> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/ >> 10.7.23.124:8020 >> at >> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522) >> at >> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >> at >> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >> at >> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >> at >> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >> at >> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >> 2013-12-02 19:49:28,646 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Trying to re-establish ZK session >> 2013-12-02 19:49:28,669 INFO org.apache.zookeeper.ZooKeeper: Session: >> 0x2429313c808025b closed >> 2013-12-02 19:49:29,672 INFO org.apache.zookeeper.ZooKeeper: Initiating >> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 >> sessionTimeout=5000 >> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@3545fe3b >> 2013-12-02 19:49:29,675 INFO org.apache.zookeeper.ClientCnxn: Opening >> socket connection to server hadoop3/10.7.23.124:2181. Will not attempt >> to authenticate using SASL (Unable to locate a login configuration) >> 2013-12-02 19:49:29,675 INFO org.apache.zookeeper.ClientCnxn: Socket >> connection established to hadoop3/10.7.23.124:2181, initiating session >> 2013-12-02 19:49:29,699 INFO org.apache.zookeeper.ClientCnxn: Session >> establishment complete on server hadoop3/10.7.23.124:2181, sessionid = >> 0x3429312ba330262, negotiated timeout = 5000 >> 2013-12-02 19:49:29,702 INFO org.apache.zookeeper.ClientCnxn: EventThread >> shut down >> 2013-12-02 19:49:29,706 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Session connected. >> 2013-12-02 19:49:29,706 INFO org.apache.hadoop.ha.ZKFailoverController: >> Quitting master election for NameNode at hadoop2/10.7.23.125:8020 and >> marking that fencing is necessary >> 2013-12-02 19:49:29,706 INFO org.apache.hadoop.ha.ActiveStandbyElector: >> Yielding from election >> 2013-12-02 19:49:29,727 INFO org.apache.zookeeper.ZooKeeper: Session: >> 0x3429312ba330262 closed >> 2013-12-02 19:49:29,728 WARN org.apache.hadoop.ha.ActiveStandbyElector: >> Ignoring stale result from old client with sessionId 0x3429312ba330262 >> 2013-12-02 19:49:29,728 INFO org.apache.zookeeper.ClientCnxn: EventThread >> shut down >> > >