Hi Yu I think when unplug the nic ,the ssh could not make through because it can not connect to failed active NN. Suppose that ,the sshfence will failed. Am I right?
2013/12/3 YouPeng Yang <yypvsxf19870...@gmail.com> > Hi Yu > > Thanks for your response. > I'm sure my ssh setup is good. Ssh from act NN to stanby nn need no > password. > > > > > > > > I attached my config > ------core-site.xml----------------- > > <configuration> > <property> > <name>fs.defaultFS</name> > <value>hdfs://lklcluster</value> > <final>true</final> > </property> > > <property> > <name>hadoop.tmp.dir</name> > <value>/home/hadoop/tmp2</value> > </property> > > > </configuration> > > > -------hdfs-site.xml---------- > --- > > <configuration> > <property> > <name>dfs.namenode.name.dir</name> > <value>/home/hadoop/namedir2</value> > </property> > > <property> > <name>dfs.datanode.data.dir</name> > <value>/home/hadoop/datadir2</value> > </property> > > <property> > <name>dfs.nameservices</name> > <value>lklcluster</value> > </property> > > <property> > <name>dfs.ha.namenodes.lklcluster</name> > <value>nn1,nn2</value> > </property> > <property> > <name>dfs.namenode.rpc-address.lklcluster.nn1</name> > <value>hadoop2:8020</value> > </property> > <property> > <name>dfs.namenode.rpc-address.lklcluster.nn2</name> > <value>hadoop3:8020</value> > </property> > > <property> > <name>dfs.namenode.http-address.lklcluster.nn1</name> > <value>hadoop2:50070</value> > </property> > > <property> > <name>dfs.namenode.http-address.lklcluster.nn2</name> > <value>hadoop3:50070</value> > </property> > > <property> > <name>dfs.namenode.shared.edits.dir</name> > > <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster</value> > </property> > <property> > <name>dfs.client.failover.proxy.provider.lklcluster</name> > > <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> > </property> > <property> > <name>dfs.ha.fencing.methods</name> > <value>sshfence</value> > </property> > > <property> > <name>dfs.ha.fencing.ssh.private-key-files</name> > <value>/home/hadoop/.ssh/id_rsa</value> > </property> > > <property> > <name>dfs.ha.fencing.ssh.connect-timeout</name> > <value>5000</value> > </property> > > <property> > <name>dfs.journalnode.edits.dir</name> > <value>/home/hadoop/journal/data</value> > </property> > > <property> > <name>dfs.ha.automatic-failover.enabled</name> > <value>true</value> > </property> > > <property> > <name>ha.zookeeper.quorum</name> > <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value> > </property> > > </configuration> > > > > 2013/12/3 Azuryy Yu <azury...@gmail.com> > >> This is still because your fence method configuraed improperly. >> plseae paste your fence configuration. and double check you can ssh on >> active NN to standby NN without password. >> >> >> On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang >> <yypvsxf19870...@gmail.com>wrote: >> >>> Hi >>> Another auto-failover testing problem: >>> >>> My HA can auto-failover after I kill the active NN.When it comes to >>> the unplug network interface to simulate the hardware fail,the >>> auto-failover seems not to work after wait for times -the zkfc logs as >>> [1]. >>> >>> I'm using the default sshfence. >>> >>> >>> >>> >>> >>> >>> [1] zkfc >>> logs---------------------------------------------------------------------------------------- >>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ====== >>> Beginning Service Fencing Process... ====== >>> 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying >>> method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) >>> 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: >>> Connecting to hadoop3... >>> 2013-12-03 10:05:56,651 INFO >>> org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 >>> 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: >>> Unable to connect to hadoop3 as user hadoop >>> com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route >>> to host >>> at com.jcraft.jsch.Util.createSocket(Util.java:386) >>> at com.jcraft.jsch.Session.connect(Session.java:182) >>> at >>> org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) >>> at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>> 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing >>> method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. >>> 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to >>> fence service by any configured method. >>> 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector: >>> Exception handling the winning of election >>> java.lang.RuntimeException: Unable to fence NameNode at hadoop3/ >>> 10.7.23.124:8020 >>> at >>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) >>> at >>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) >>> at >>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) >>> at >>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) >>> at >>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>> 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector: >>> Trying to re-establish ZK session >>> 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session: >>> 0x142931031810260 closed >>> 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating >>> client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 >>> sessionTimeout=5000 >>> watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea >>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening >>> socket connection to server hadoop1/10.7.23.122:2181. Will not attempt >>> to authenticate using SASL (Unable to locate a login configuration) >>> 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket >>> connection established to hadoop1/10.7.23.122:2181, initiating session >>> 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session >>> establishment complete on server hadoop1/10.7.23.122:2181, sessionid = >>> 0x142931031810261, negotiated timeout = 5000 >>> 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: >>> EventThread shut down >>> >> >> >