[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-09-01 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7763:
--
Fix Version/s: 2.6.1

[~sjlee0] backported this to 2.6.1. I just pushed the commit to 2.6.1 after 
running compilation, the patch applied cleanly.

> fix zkfc hung issue due to not catching exception in a corner case
> --
>
> Key: HDFS-7763
> URL: https://issues.apache.org/jira/browse/HDFS-7763
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.6.0
>Reporter: Liang Xie
>Assignee: Liang Xie
>  Labels: 2.6.1-candidate
> Fix For: 2.7.0, 2.6.1
>
> Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936
>
>
> In our product cluster, we hit both the two zkfc process is hung after a zk 
> network outage.
> the zkfc log said:
> {code}
> 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
> timed out, have not heard from server in 3334ms for sessionid 
> 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
> 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
> Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
> znode monitoring connection errors.
> 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x4a61bacdd9dfb2 closed
> 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
> Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
> Not retrying further znode monitoring connection errors.
> 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
> 11300
> 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
> 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server Responder
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
> HealthMonitor thread
> 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server listener on 11300
> {code}
> and the thread dump also be uploaded as attachment.
> From the dump, we can see due to the unknown non-daemon 
> threads(pool-*-thread-*), the process did not exit, but the critical threads, 
> like health monitor and rpc threads had been stopped, so our 
> watchdog(supervisord) had not not observed the zkfc process is down or 
> abnormal.  so the following namenode failover could not be done as expected.
> there're two possible fixes here, 1) figure out the unset-thread-name, like 
> pool-7-thread-1, where them came from and close or set daemon property. i 
> tried to search but got nothing right now. 2) catch the exception from 
> ZKFailoverController.run() so we can continue to exec the System.exit, the 
> attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-07-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7763:
--
Labels: 2.6.1-candidate  (was: )

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
  Labels: 2.6.1-candidate
 Fix For: 2.7.0

 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-02-24 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-7763:
--
   Resolution: Fixed
Fix Version/s: 2.7.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2, thanks for the nice find and fix 
[~xieliang007]!

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
 Fix For: 2.7.0

 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-02-11 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HDFS-7763:

Attachment: HDFS-7763-002.txt

How about v2, Colin?

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HDFS-7763-001.txt, HDFS-7763-002.txt, jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-02-10 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HDFS-7763:

Attachment: (was: HDFS-7763.txt)

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-02-10 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HDFS-7763:

Attachment: HDFS-7763-001.txt

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HDFS-7763-001.txt, jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7763) fix zkfc hung issue due to not catching exception in a corner case

2015-02-09 Thread Liang Xie (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Xie updated HDFS-7763:

Attachment: jstack.4936
HDFS-7763.txt

 fix zkfc hung issue due to not catching exception in a corner case
 --

 Key: HDFS-7763
 URL: https://issues.apache.org/jira/browse/HDFS-7763
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ha
Affects Versions: 2.6.0
Reporter: Liang Xie
Assignee: Liang Xie
 Attachments: HDFS-7763.txt, jstack.4936


 In our product cluster, we hit both the two zkfc process is hung after a zk 
 network outage.
 the zkfc log said:
 {code}
 2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session 
 timed out, have not heard from server in 3334ms for sessionid 
 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
 2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: 
 Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further 
 znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 
 0x4a61bacdd9dfb2 closed
 2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: 
 Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. 
 Not retrying further znode monitoring connection errors.
 2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 
 11300
 2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
 Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
 2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread 
 shut down
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
 Yielding from election
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server Responder
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping 
 HealthMonitor thread
 2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
 Server listener on 11300
 {code}
 and the thread dump also be uploaded as attachment.
 From the dump, we can see due to the unknown non-daemon 
 threads(pool-*-thread-*), the process did not exit, but the critical threads, 
 like health monitor and rpc threads had been stopped, so our 
 watchdog(supervisord) had not not observed the zkfc process is down or 
 abnormal.  so the following namenode failover could not be done as expected.
 there're two possible fixes here, 1) figure out the unset-thread-name, like 
 pool-7-thread-1, where them came from and close or set daemon property. i 
 tried to search but got nothing right now. 2) catch the exception from 
 ZKFailoverController.run() so we can continue to exec the System.exit, the 
 attached patch is 2).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)