[jira] [Created] (HDFS-13760) improve ZKFC fencing action when network of ZKFC interrupt

He Xiaoqiao (JIRA) Mon, 23 Jul 2018 07:14:15 -0700

He Xiaoqiao created HDFS-13760:
----------------------------------

             Summary: improve ZKFC fencing action when network of ZKFC interrupt
                 Key: HDFS-13760
                 URL: https://issues.apache.org/jira/browse/HDFS-13760
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: ha
            Reporter: He Xiaoqiao



when host of Active NameNode & ZKFC meet network fault for quite a time, HDFS 
will be not available since ZKFC located on Standby NameNode will never ssh 
fence success due to it could not ssh to Active NameNode. In such situation, 
for Client, it could not connect to Active NameNode, then failover to Standby 
but it could not provide READ/WRITE.
{code:xml}
2018-07-23 15:57:10,836 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 40 
time(s); maxRetries=45
2018-07-23 15:57:30,856 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 41 
time(s); maxRetries=45
2018-07-23 15:57:50,872 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 42 
time(s); maxRetries=45
2018-07-23 15:58:10,892 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 43 
time(s); maxRetries=45
2018-07-23 15:58:30,912 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rz-data-hdp-nn14.rz.sankuai.com/10.16.70.34:8060. Already tried 44 
time(s); maxRetries=45
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ZKFailoverController: get old 
active state exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 
millis timeout while waiting for channel to be 
ready for connect. ch : java.nio.channels.SocketChannel[connection-pending 
local=/ip:port remote=hostname]
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: old 
active is not healthy. need to create znode
2018-07-23 15:58:50,933 INFO org.apache.hadoop.ha.ActiveStandbyElector: Elector 
callbacks for NameNode at standbynn start create node, now time: 
45179010079342817
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
CreateNode result: 0 code:OK for path: /hadoop-ha/ns/ActiveStandbyElectorLock 
connectionState: CONNECTED  for elector id=469098346 
appData=0a07727a2d6e6e313312046e6e31331a1f727a2d646174612d6864702d6e6e31332e727a2e73616e6b7561692e636f6d20e83e28d33e
 cb=Elector callbacks for NameNode at standbynamenode
2018-07-23 15:58:50,936 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Checking for any old active which needs to be fenced...
2018-07-23 15:58:50,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old 
node exists: 
0a07727a2d6e6e313312046e6e31341a1f727a2d646174612d6864702d6e6e31342e727a2e73616e6b7561692e636f6d20e83e28d33e
2018-07-23 15:58:50,939 INFO org.apache.hadoop.ha.ZKFailoverController: Should 
fence: NameNode at activenamenode
2018-07-23 15:59:10,960 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: activenamenode. Already tried 0 time(s); maxRetries=1
2018-07-23 15:59:30,980 WARN org.apache.hadoop.ha.FailoverController: Unable to 
gracefully make NameNode at activenamenode standby (unable to connect)
org.apache.hadoop.net.ConnectTimeoutException: Call From standbynamenode to 
activenamenode failed on socket timeout exception: 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending local=ip:port 
remote=activenamenode]; For more details see:  
http://wiki.apache.org/hadoop/SocketTimeout
{code}

I propose that when Active NameNode meet network fault, ZKFC force this 
NameNode to become Standby, and another ZKFC could hold the ZNode for election 
and transition other NameNode to Active even when ssh fence fail.

There is no available patch now, and I am very welcome to hear some suggestion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-13760) improve ZKFC fencing action when network of ZKFC interrupt

Reply via email to