Can not auto-failover when unplug network interface
Hi Another auto-failover testing problem: My HA can auto-failover after I kill the active NN.When it comes to the unplug network interface to simulate the hardware fail,the auto-failover seems not to work after wait for times -the zkfc logs as [1]. I'm using the default sshfence. [1] zkfc logs 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: == Beginning Service Fencing Process... == 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoop3... 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to hadoop3 as user hadoop com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host at com.jcraft.jsch.Util.createSocket(Util.java:386) at com.jcraft.jsch.Session.connect(Session.java:182) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method. 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election java.lang.RuntimeException: Unable to fence NameNode at hadoop3/ 10.7.23.124:8020 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session: 0x142931031810260 closed 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop1/10.7.23.122:2181, initiating session 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop1/10.7.23.122:2181, sessionid = 0x142931031810261, negotiated timeout = 5000 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
Re: Can not auto-failover when unplug network interface
This is still because your fence method configuraed improperly. plseae paste your fence configuration. and double check you can ssh on active NN to standby NN without password. On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Another auto-failover testing problem: My HA can auto-failover after I kill the active NN.When it comes to the unplug network interface to simulate the hardware fail,the auto-failover seems not to work after wait for times -the zkfc logs as [1]. I'm using the default sshfence. [1] zkfc logs 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: == Beginning Service Fencing Process... == 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoop3... 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to hadoop3 as user hadoop com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host at com.jcraft.jsch.Util.createSocket(Util.java:386) at com.jcraft.jsch.Session.connect(Session.java:182) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method. 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election java.lang.RuntimeException: Unable to fence NameNode at hadoop3/ 10.7.23.124:8020 at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session: 0x142931031810260 closed 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration) 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop1/10.7.23.122:2181, initiating session 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop1/10.7.23.122:2181, sessionid = 0x142931031810261, negotiated timeout = 5000 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
Re: Can not auto-failover when unplug network interface
Hi Yu Thanks for your response. I'm sure my ssh setup is good. Ssh from act NN to stanby nn need no password. I attached my config --core-site.xml- configuration property namefs.defaultFS/name valuehdfs://lklcluster/value finaltrue/final /property property namehadoop.tmp.dir/name value/home/hadoop/tmp2/value /property /configuration ---hdfs-site.xml-- --- configuration property namedfs.namenode.name.dir/name value/home/hadoop/namedir2/value /property property namedfs.datanode.data.dir/name value/home/hadoop/datadir2/value /property property namedfs.nameservices/name valuelklcluster/value /property property namedfs.ha.namenodes.lklcluster/name valuenn1,nn2/value /property property namedfs.namenode.rpc-address.lklcluster.nn1/name valuehadoop2:8020/value /property property namedfs.namenode.rpc-address.lklcluster.nn2/name valuehadoop3:8020/value /property property namedfs.namenode.http-address.lklcluster.nn1/name valuehadoop2:50070/value /property property namedfs.namenode.http-address.lklcluster.nn2/name valuehadoop3:50070/value /property property namedfs.namenode.shared.edits.dir/name valueqjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster/value /property property namedfs.client.failover.proxy.provider.lklcluster/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value /property property namedfs.ha.fencing.methods/name valuesshfence/value /property property namedfs.ha.fencing.ssh.private-key-files/name value/home/hadoop/.ssh/id_rsa/value /property property namedfs.ha.fencing.ssh.connect-timeout/name value5000/value /property property namedfs.journalnode.edits.dir/name value/home/hadoop/journal/data/value /property property namedfs.ha.automatic-failover.enabled/name valuetrue/value /property property nameha.zookeeper.quorum/name valuehadoop1:2181,hadoop2:2181,hadoop3:2181/value /property /configuration 2013/12/3 Azuryy Yu azury...@gmail.com This is still because your fence method configuraed improperly. plseae paste your fence configuration. and double check you can ssh on active NN to standby NN without password. On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Another auto-failover testing problem: My HA can auto-failover after I kill the active NN.When it comes to the unplug network interface to simulate the hardware fail,the auto-failover seems not to work after wait for times -the zkfc logs as [1]. I'm using the default sshfence. [1] zkfc logs 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: == Beginning Service Fencing Process... == 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoop3... 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to hadoop3 as user hadoop com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host at com.jcraft.jsch.Util.createSocket(Util.java:386) at com.jcraft.jsch.Session.connect(Session.java:182) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful. 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method. 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
Re: Can not auto-failover when unplug network interface
Hi Yu I think when unplug the nic ,the ssh could not make through because it can not connect to failed active NN. Suppose that ,the sshfence will failed. Am I right? 2013/12/3 YouPeng Yang yypvsxf19870...@gmail.com Hi Yu Thanks for your response. I'm sure my ssh setup is good. Ssh from act NN to stanby nn need no password. I attached my config --core-site.xml- configuration property namefs.defaultFS/name valuehdfs://lklcluster/value finaltrue/final /property property namehadoop.tmp.dir/name value/home/hadoop/tmp2/value /property /configuration ---hdfs-site.xml-- --- configuration property namedfs.namenode.name.dir/name value/home/hadoop/namedir2/value /property property namedfs.datanode.data.dir/name value/home/hadoop/datadir2/value /property property namedfs.nameservices/name valuelklcluster/value /property property namedfs.ha.namenodes.lklcluster/name valuenn1,nn2/value /property property namedfs.namenode.rpc-address.lklcluster.nn1/name valuehadoop2:8020/value /property property namedfs.namenode.rpc-address.lklcluster.nn2/name valuehadoop3:8020/value /property property namedfs.namenode.http-address.lklcluster.nn1/name valuehadoop2:50070/value /property property namedfs.namenode.http-address.lklcluster.nn2/name valuehadoop3:50070/value /property property namedfs.namenode.shared.edits.dir/name valueqjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster/value /property property namedfs.client.failover.proxy.provider.lklcluster/name valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value /property property namedfs.ha.fencing.methods/name valuesshfence/value /property property namedfs.ha.fencing.ssh.private-key-files/name value/home/hadoop/.ssh/id_rsa/value /property property namedfs.ha.fencing.ssh.connect-timeout/name value5000/value /property property namedfs.journalnode.edits.dir/name value/home/hadoop/journal/data/value /property property namedfs.ha.automatic-failover.enabled/name valuetrue/value /property property nameha.zookeeper.quorum/name valuehadoop1:2181,hadoop2:2181,hadoop3:2181/value /property /configuration 2013/12/3 Azuryy Yu azury...@gmail.com This is still because your fence method configuraed improperly. plseae paste your fence configuration. and double check you can ssh on active NN to standby NN without password. On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Another auto-failover testing problem: My HA can auto-failover after I kill the active NN.When it comes to the unplug network interface to simulate the hardware fail,the auto-failover seems not to work after wait for times -the zkfc logs as [1]. I'm using the default sshfence. [1] zkfc logs 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: == Beginning Service Fencing Process... == 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null) 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to hadoop3... 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to hadoop3 as user hadoop com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host at com.jcraft.jsch.Util.createSocket(Util.java:386) at com.jcraft.jsch.Session.connect(Session.java:182) at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100) at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97) at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521) at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494) at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59) at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837) at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was