[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated HDFS-7763: ------------------------------------------ Fix Version/s: 2.6.1 [~sjlee0] backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation, the patch applied cleanly. > fix zkfc hung issue due to not catching exception in a corner case > ------------------------------------------------------------------ > > Key: HDFS-7763 > URL: https://issues.apache.org/jira/browse/HDFS-7763 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha > Affects Versions: 2.6.0 > Reporter: Liang Xie > Assignee: Liang Xie > Labels: 2.6.1-candidate > Fix For: 2.7.0, 2.6.1 > > Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 > > > In our product cluster, we hit both the two zkfc process is hung after a zk > network outage. > the zkfc log said: > {code} > 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 3334ms for sessionid > 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect > 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: > Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further > znode monitoring connection errors. > 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x4a61bacdd9dfb2 closed > 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: > Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. > Not retrying further znode monitoring connection errors. > 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on > 11300 > 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Yielding from election > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server Responder > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping > HealthMonitor thread > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server listener on 11300 > {code} > and the thread dump also be uploaded as attachment. > From the dump, we can see due to the unknown non-daemon > threads(pool-*-thread-*), the process did not exit, but the critical threads, > like health monitor and rpc threads had been stopped, so our > watchdog(supervisord) had not not observed the zkfc process is down or > abnormal. so the following namenode failover could not be done as expected. > there're two possible fixes here, 1) figure out the unset-thread-name, like > pool-7-thread-1, where them came from and close or set daemon property. i > tried to search but got nothing right now. 2) catch the exception from > ZKFailoverController.run() so we can continue to exec the System.exit, the > attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)