Can not auto-failover when unplug network interface

2013-12-02 Thread YouPeng Yang
Hi
   Another auto-failover testing problem:

   My HA can auto-failover after I kill the active NN.When it comes to the
unplug  network interface to simulate the hardware fail,the auto-failover
seems  not to work after   wait for times -the zkfc logs as [1].

   I'm using the default sshfence.






[1] zkfc
logs
2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ==
Beginning Service Fencing Process... ==
2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying method
1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
Connecting to hadoop3...
2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
Connecting to hadoop3 port 22
2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable
to connect to hadoop3 as user hadoop
com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to
host
at com.jcraft.jsch.Util.createSocket(Util.java:386)
at com.jcraft.jsch.Session.connect(Session.java:182)
at
org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
fence service by any configured method.
2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
10.7.23.124:8020
at
org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
at
org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
at
org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
at
org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
at
org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Trying to re-establish ZK session
2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
0x142931031810260 closed
2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
sessionTimeout=5000
watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
authenticate using SASL (Unable to locate a login configuration)
2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop1/10.7.23.122:2181, initiating session
2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
0x142931031810261, negotiated timeout = 5000
2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down


Re: Can not auto-failover when unplug network interface

2013-12-02 Thread Azuryy Yu
This is still because your fence method configuraed improperly.
plseae paste your fence configuration. and double check you can ssh on
active NN to standby NN without password.


On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:

 Hi
Another auto-failover testing problem:

My HA can auto-failover after I kill the active NN.When it comes to the
 unplug  network interface to simulate the hardware fail,the auto-failover
 seems  not to work after   wait for times -the zkfc logs as [1].

I'm using the default sshfence.






 [1] zkfc
 logs
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ==
 Beginning Service Fencing Process... ==
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
 method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
 Connecting to hadoop3...
 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
 Connecting to hadoop3 port 22
 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
 Unable to connect to hadoop3 as user hadoop
 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
 to host
 at com.jcraft.jsch.Util.createSocket(Util.java:386)
 at com.jcraft.jsch.Session.connect(Session.java:182)
 at
 org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
 at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
 at
 org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
 at
 org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
 at
 org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
 at
 org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
 method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
 fence service by any configured method.
 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
 Exception handling the winning of election
 java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
 10.7.23.124:8020
 at
 org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:522)
 at
 org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
 at
 org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
 at
 org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
 at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
 2013-12-03 10:05:59,650 INFO org.apache.hadoop.ha.ActiveStandbyElector:
 Trying to re-establish ZK session
 2013-12-03 10:05:59,676 INFO org.apache.zookeeper.ZooKeeper: Session:
 0x142931031810260 closed
 2013-12-03 10:06:00,678 INFO org.apache.zookeeper.ZooKeeper: Initiating
 client connection, connectString=hadoop1:2181,hadoop2:2181,hadoop3:2181
 sessionTimeout=5000
 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef@5ce2acea
 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Opening
 socket connection to server hadoop1/10.7.23.122:2181. Will not attempt to
 authenticate using SASL (Unable to locate a login configuration)
 2013-12-03 10:06:00,681 INFO org.apache.zookeeper.ClientCnxn: Socket
 connection established to hadoop1/10.7.23.122:2181, initiating session
 2013-12-03 10:06:00,709 INFO org.apache.zookeeper.ClientCnxn: Session
 establishment complete on server hadoop1/10.7.23.122:2181, sessionid =
 0x142931031810261, negotiated timeout = 5000
 2013-12-03 10:06:00,711 INFO org.apache.zookeeper.ClientCnxn: EventThread
 shut down



Re: Can not auto-failover when unplug network interface

2013-12-02 Thread YouPeng Yang
Hi Yu

  Thanks for your response.
  I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
password.







I attached my config
--core-site.xml-

configuration
 property
 namefs.defaultFS/name
 valuehdfs://lklcluster/value
 finaltrue/final
 /property

 property
 namehadoop.tmp.dir/name
 value/home/hadoop/tmp2/value
 /property


/configuration


---hdfs-site.xml--
---

configuration
 property
 namedfs.namenode.name.dir/name
value/home/hadoop/namedir2/value
 /property

 property
 namedfs.datanode.data.dir/name
 value/home/hadoop/datadir2/value
 /property

 property
   namedfs.nameservices/name
   valuelklcluster/value
/property

property
namedfs.ha.namenodes.lklcluster/name
valuenn1,nn2/value
/property
property
  namedfs.namenode.rpc-address.lklcluster.nn1/name
  valuehadoop2:8020/value
/property
property
namedfs.namenode.rpc-address.lklcluster.nn2/name
valuehadoop3:8020/value
/property

property
  namedfs.namenode.http-address.lklcluster.nn1/name
valuehadoop2:50070/value
/property

property
namedfs.namenode.http-address.lklcluster.nn2/name
valuehadoop3:50070/value
/property

property
  namedfs.namenode.shared.edits.dir/name

valueqjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster/value
/property
property
  namedfs.client.failover.proxy.provider.lklcluster/name

valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
/property
property
  namedfs.ha.fencing.methods/name
  valuesshfence/value
/property

property
  namedfs.ha.fencing.ssh.private-key-files/name
   value/home/hadoop/.ssh/id_rsa/value
/property

property
namedfs.ha.fencing.ssh.connect-timeout/name
 value5000/value
/property

property
  namedfs.journalnode.edits.dir/name
   value/home/hadoop/journal/data/value
/property

property
   namedfs.ha.automatic-failover.enabled/name
  valuetrue/value
/property

property
 nameha.zookeeper.quorum/name
 valuehadoop1:2181,hadoop2:2181,hadoop3:2181/value
/property

/configuration



2013/12/3 Azuryy Yu azury...@gmail.com

 This is still because your fence method configuraed improperly.
 plseae paste your fence configuration. and double check you can ssh on
 active NN to standby NN without password.


 On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang 
 yypvsxf19870...@gmail.comwrote:

 Hi
Another auto-failover testing problem:

My HA can auto-failover after I kill the active NN.When it comes to
 the unplug  network interface to simulate the hardware fail,the
 auto-failover seems  not to work after   wait for times -the zkfc logs as
 [1].

I'm using the default sshfence.






 [1] zkfc
 logs
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ==
 Beginning Service Fencing Process... ==
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
 method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
 Connecting to hadoop3...
 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch:
 Connecting to hadoop3 port 22
 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
 Unable to connect to hadoop3 as user hadoop
 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
 to host
 at com.jcraft.jsch.Util.createSocket(Util.java:386)
 at com.jcraft.jsch.Session.connect(Session.java:182)
 at
 org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
 at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
 at
 org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
 at
 org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
 at
 org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
 at
 org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
 method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
 2013-12-03 10:05:59,649 ERROR org.apache.hadoop.ha.NodeFencer: Unable to
 fence service by any configured method.
 2013-12-03 10:05:59,650 WARN org.apache.hadoop.ha.ActiveStandbyElector:
 Exception handling the winning of election
 java.lang.RuntimeException: Unable to fence NameNode at hadoop3/
 

Re: Can not auto-failover when unplug network interface

2013-12-02 Thread YouPeng Yang
Hi Yu

   I think when unplug the nic ,the ssh could not make through because it
can not connect to  failed  active NN.
Suppose that ,the sshfence will failed.
   Am I right?


2013/12/3 YouPeng Yang yypvsxf19870...@gmail.com

 Hi Yu

   Thanks for your response.
   I'm sure my ssh setup is good. Ssh from  act NN to stanby nn need no
 password.







 I attached my config
 --core-site.xml-

 configuration
  property
  namefs.defaultFS/name
  valuehdfs://lklcluster/value
  finaltrue/final
  /property

  property
  namehadoop.tmp.dir/name
  value/home/hadoop/tmp2/value
  /property


 /configuration


 ---hdfs-site.xml--
 ---

 configuration
  property
  namedfs.namenode.name.dir/name
 value/home/hadoop/namedir2/value
  /property

  property
  namedfs.datanode.data.dir/name
  value/home/hadoop/datadir2/value
  /property

  property
namedfs.nameservices/name
valuelklcluster/value
 /property

 property
 namedfs.ha.namenodes.lklcluster/name
 valuenn1,nn2/value
 /property
 property
   namedfs.namenode.rpc-address.lklcluster.nn1/name
   valuehadoop2:8020/value
 /property
 property
 namedfs.namenode.rpc-address.lklcluster.nn2/name
 valuehadoop3:8020/value
 /property

 property
   namedfs.namenode.http-address.lklcluster.nn1/name
 valuehadoop2:50070/value
 /property

 property
 namedfs.namenode.http-address.lklcluster.nn2/name
 valuehadoop3:50070/value
 /property

 property
   namedfs.namenode.shared.edits.dir/name

 valueqjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/lklcluster/value
 /property
 property
   namedfs.client.failover.proxy.provider.lklcluster/name

 valueorg.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider/value
 /property
 property
   namedfs.ha.fencing.methods/name
   valuesshfence/value
 /property

 property
   namedfs.ha.fencing.ssh.private-key-files/name
value/home/hadoop/.ssh/id_rsa/value
 /property

 property
 namedfs.ha.fencing.ssh.connect-timeout/name
  value5000/value
 /property

 property
   namedfs.journalnode.edits.dir/name
value/home/hadoop/journal/data/value
 /property

 property
namedfs.ha.automatic-failover.enabled/name
   valuetrue/value
 /property

 property
  nameha.zookeeper.quorum/name
  valuehadoop1:2181,hadoop2:2181,hadoop3:2181/value
 /property

 /configuration



 2013/12/3 Azuryy Yu azury...@gmail.com

 This is still because your fence method configuraed improperly.
 plseae paste your fence configuration. and double check you can ssh on
 active NN to standby NN without password.


 On Tue, Dec 3, 2013 at 10:23 AM, YouPeng Yang 
 yypvsxf19870...@gmail.comwrote:

 Hi
Another auto-failover testing problem:

My HA can auto-failover after I kill the active NN.When it comes to
 the unplug  network interface to simulate the hardware fail,the
 auto-failover seems  not to work after   wait for times -the zkfc logs as
 [1].

I'm using the default sshfence.






 [1] zkfc
 logs
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: ==
 Beginning Service Fencing Process... ==
 2013-12-03 10:05:56,650 INFO org.apache.hadoop.ha.NodeFencer: Trying
 method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
 2013-12-03 10:05:56,651 INFO org.apache.hadoop.ha.SshFenceByTcpPort:
 Connecting to hadoop3...
 2013-12-03 10:05:56,651 INFO
 org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to hadoop3 port 22
 2013-12-03 10:05:59,648 WARN org.apache.hadoop.ha.SshFenceByTcpPort:
 Unable to connect to hadoop3 as user hadoop
 com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route
 to host
 at com.jcraft.jsch.Util.createSocket(Util.java:386)
 at com.jcraft.jsch.Session.connect(Session.java:182)
 at
 org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
 at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
 at
 org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:521)
 at
 org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
 at
 org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
 at
 org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:900)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:799)
 at
 org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:596)
 at
 org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
 2013-12-03 10:05:59,649 WARN org.apache.hadoop.ha.NodeFencer: Fencing
 method org.apache.hadoop.ha.SshFenceByTcpPort(null) was