[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated HDFS-7763: -- Fix Version/s: 2.6.1 [~sjlee0] backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation, the patch applied cleanly. > fix zkfc hung issue due to not catching exception in a corner case > -- > > Key: HDFS-7763 > URL: https://issues.apache.org/jira/browse/HDFS-7763 > Project: Hadoop HDFS > Issue Type: Bug > Components: ha >Affects Versions: 2.6.0 >Reporter: Liang Xie >Assignee: Liang Xie > Labels: 2.6.1-candidate > Fix For: 2.7.0, 2.6.1 > > Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 > > > In our product cluster, we hit both the two zkfc process is hung after a zk > network outage. > the zkfc log said: > {code} > 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session > timed out, have not heard from server in 3334ms for sessionid > 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect > 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: > Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further > znode monitoring connection errors. > 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: > 0x4a61bacdd9dfb2 closed > 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: > Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. > Not retrying further znode monitoring connection errors. > 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on > 11300 > 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 > 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread > shut down > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: > Yielding from election > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server Responder > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping > HealthMonitor thread > 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC > Server listener on 11300 > {code} > and the thread dump also be uploaded as attachment. > From the dump, we can see due to the unknown non-daemon > threads(pool-*-thread-*), the process did not exit, but the critical threads, > like health monitor and rpc threads had been stopped, so our > watchdog(supervisord) had not not observed the zkfc process is down or > abnormal. so the following namenode failover could not be done as expected. > there're two possible fixes here, 1) figure out the unset-thread-name, like > pool-7-thread-1, where them came from and close or set daemon property. i > tried to search but got nothing right now. 2) catch the exception from > ZKFailoverController.run() so we can continue to exec the System.exit, the > attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated HDFS-7763: -- Labels: 2.6.1-candidate (was: ) fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Labels: 2.6.1-candidate Fix For: 2.7.0 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7763: -- Resolution: Fixed Fix Version/s: 2.7.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2, thanks for the nice find and fix [~xieliang007]! fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Fix For: 2.7.0 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-7763: Attachment: HDFS-7763-002.txt How about v2, Colin? fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-7763: Attachment: (was: HDFS-7763.txt) fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Attachments: jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-7763: Attachment: HDFS-7763-001.txt fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Attachments: HDFS-7763-001.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case
[ https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Xie updated HDFS-7763: Attachment: jstack.4936 HDFS-7763.txt fix zkfc hung issue due to not catching exception in a corner case -- Key: HDFS-7763 URL: https://issues.apache.org/jira/browse/HDFS-7763 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.6.0 Reporter: Liang Xie Assignee: Liang Xie Attachments: HDFS-7763.txt, jstack.4936 In our product cluster, we hit both the two zkfc process is hung after a zk network outage. the zkfc log said: {code} 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors. 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300 {code} and the thread dump also be uploaded as attachment. From the dump, we can see due to the unknown non-daemon threads(pool-*-thread-*), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected. there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2). -- This message was sent by Atlassian JIRA (v6.3.4#6332)