[ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335658#comment-14335658
 ] 

Andrew Wang commented on HDFS-7763:
-----------------------------------

This looks good to me, though one little nit is we could do {{System.exit}} in 
a {{finally}}.

+1, I'll commit shortly.

> fix zkfc hung issue due to not catching exception in a corner case
> ------------------------------------------------------------------
>
>                 Key: HDFS-7763
>                 URL: https://issues.apache.org/jira/browse/HDFS-7763
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.6.0
>            Reporter: Liang Xie
>            Assignee: Liang Xie
>         Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936
>
>
> In our product cluster, we hit both the two zkfc process is hung after a zk 
> network outage.
> the zkfc log said:
> {code}
> 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 3334ms for sessionid 
> 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
> 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
> Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
> znode monitoring connection errors.
> 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x4a61bacdd9dfb2 closed
> 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
> Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
> Not retrying further znode monitoring connection errors.
> 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
> 11300
> 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server Responder
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
> HealthMonitor thread
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server listener on 11300
> {code}
> and the thread dump also be uploaded as attachment.
> From the dump, we can see due to the unknown non-daemon 
> threads(pool-*-thread-*), the process did not exit, but the critical threads, 
> like health monitor and rpc threads had been stopped, so our 
> watchdog(supervisord) had not not observed the zkfc process is down or 
> abnormal.  so the following namenode failover could not be done as expected.
> there're two possible fixes here, 1) figure out the unset-thread-name, like 
> pool-7-thread-1, where them came from and close or set daemon property. i 
> tried to search but got nothing right now. 2) catch the exception from 
> ZKFailoverController.run() so we can continue to exec the System.exit, the 
> attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to