Re: Design document
https://cwiki.apache.org/confluence/display/ZOOKEEPER/FAQ http://web.stanford.edu/class/cs347/reading/zab.pdf Start with these. -Jordan On June 22, 2015 at 3:34:24 PM, sajjad rizvi (sm3ri...@uwaterloo.ca) wrote: Hi, I am just curious, is there any design document available to help in understanding the ZooKeeper code? In a research project, I have to make some significant changes in the quorum part of the code. Although the code is very elegant and self descriptive, any design document will be very helpful. Thanks, Sajjad Rizvi
[jira] [Commented] (ZOOKEEPER-2210) clock_gettime is not available in os x
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595750#comment-14595750 ] Hudson commented on ZOOKEEPER-2210: --- SUCCESS: Integrated in ZooKeeper-trunk #2734 (See [https://builds.apache.org/job/ZooKeeper-trunk/2734/]) ZOOKEEPER-2210: clock_gettime is not available in OS X (Michi Mutsuzaki via rgs) (rgs: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1686767) * /zookeeper/trunk/CHANGES.txt * /zookeeper/trunk/src/c/src/zookeeper.c clock_gettime is not available in os x -- Key: ZOOKEEPER-2210 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2210 Project: ZooKeeper Issue Type: Bug Components: c client Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-2210.patch, ZOOKEEPER-2210.patch {noformat} src/zookeeper.c:286:9: warning: implicit declaration of function 'clock_gettime' is invalid in C99 [-Wimplicit-function-declaration] ret = clock_gettime(CLOCK_MONOTONIC, ts); ^ src/zookeeper.c:286:23: error: use of undeclared identifier 'CLOCK_MONOTONIC' ret = clock_gettime(CLOCK_MONOTONIC, ts); {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Design document
Hi, I am just curious, is there any design document available to help in understanding the ZooKeeper code? In a research project, I have to make some significant changes in the quorum part of the code. Although the code is very elegant and self descriptive, any design document will be very helpful. Thanks, Sajjad Rizvi
Re: Design document
Thank you Jordan, these are good pointers. On Mon, Jun 22, 2015 at 4:37 PM, Jordan Zimmerman jor...@jordanzimmerman.com wrote: https://cwiki.apache.org/confluence/display/ZOOKEEPER/FAQ http://web.stanford.edu/class/cs347/reading/zab.pdf Start with these. -Jordan On June 22, 2015 at 3:34:24 PM, sajjad rizvi (sm3ri...@uwaterloo.ca) wrote: Hi, I am just curious, is there any design document available to help in understanding the ZooKeeper code? In a research project, I have to make some significant changes in the quorum part of the code. Although the code is very elegant and self descriptive, any design document will be very helpful. Thanks, Sajjad Rizvi
[jira] [Commented] (ZOOKEEPER-2218) Close IO Streams in finally block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597030#comment-14597030 ] ASF GitHub Bot commented on ZOOKEEPER-2218: --- GitHub user sugartxy opened a pull request: https://github.com/apache/zookeeper/pull/36 #ZOOKEEPER-2218 Close IO Streams in finally block Place the close method in the finally clause, so we can ensure it always runs regardless of how the method exits. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sugartxy/zookeeper CloseRightly Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit 90745d7476504630c1d68772c26546b28639ba91 Author: sugartxy tgt...@163.com Date: 2015-06-23T02:36:11Z #ZOOKEEPER-2218 Close IO Streams in finally block Place the close method in the finally clause, so we can ensure it always runs regardless of how the method exits. Close IO Streams in finally block - Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (ZOOKEEPER-2218) Close IO Streams in finally block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales reopened ZOOKEEPER-2218: --- Close IO Streams in finally block - Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2218) Close IO Streams in finally block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597125#comment-14597125 ] Raul Gutierrez Segales commented on ZOOKEEPER-2218: --- Thanks for the patch [~tgttxy]! Lets reopen the issue though, since it hasn't been merged yet. Close IO Streams in finally block - Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597155#comment-14597155 ] Yasuhito Fukuda commented on ZOOKEEPER-2193: Thank you for your review. I attached v8 patch based on your comments. and, I posted a new diff on the reviewboard. https://reviews.apache.org/r/35204/diff/4-5/ reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193-v8.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (ZOOKEEPER-2218) Close IO Streams in finally block
Tang Xinye created ZOOKEEPER-2218: - Summary: Close IO Streams in finally block Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (ZOOKEEPER-2218) Close IO Streams in finally block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tang Xinye resolved ZOOKEEPER-2218. --- Resolution: Fixed Release Note: Place the close method in the finally clause, so we can ensure it always runs regardless of how the method exits. Issue resolved by pull request https://github.com/apache/zookeeper/pull/36 Close IO Streams in finally block - Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] zookeeper pull request: #ZOOKEEPER-2218 Close IO Streams in finall...
GitHub user sugartxy opened a pull request: https://github.com/apache/zookeeper/pull/36 #ZOOKEEPER-2218 Close IO Streams in finally block Place the close method in the finally clause, so we can ensure it always runs regardless of how the method exits. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sugartxy/zookeeper CloseRightly Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit 90745d7476504630c1d68772c26546b28639ba91 Author: sugartxy tgt...@163.com Date: 2015-06-23T02:36:11Z #ZOOKEEPER-2218 Close IO Streams in finally block Place the close method in the finally clause, so we can ensure it always runs regardless of how the method exits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (ZOOKEEPER-1792) Observers don't need to keep an in-memory copy of last commited proposals
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raul Gutierrez Segales updated ZOOKEEPER-1792: -- Summary: Observers don't need to keep an in-memory copy of last commited proposals (was: Observers don't need to keep the an in-memory copy of last commited proposals ) Observers don't need to keep an in-memory copy of last commited proposals -- Key: ZOOKEEPER-1792 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1792 Project: ZooKeeper Issue Type: Improvement Reporter: Raul Gutierrez Segales Priority: Minor In FinalRequestProcessor.java#processRequest we have: {noformat} if (request.isQuorum()) { zks.getZKDatabase().addCommittedProposal(request); } {noformat} but this is only useful to the leader since committed proposals are only used from LearnerHandler to sync up followers. I presume followers do need it as they might become a leader at any point. But observers have no need for them, so we could probably special case this for them and optimize the path for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 35204: ZOOKEEPER-2193: reconfig command completes even if parameter is wrong obviously
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/35204/ --- (Updated 6月 23, 2015, 1:39 p.m.) Review request for zookeeper. Bugs: ZOOKEEPER-2193 https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Repository: zookeeper-git Description --- See ZOOKEEPER-2193 Diffs (updated) - src/java/main/org/apache/zookeeper/server/PrepRequestProcessor.java eb045de19c9eeb632e5f2b98c5465abcaead7740 src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java f15f831701f9c8514db5003ebd550cd3880b48c7 Diff: https://reviews.apache.org/r/35204/diff/ Testing --- Thanks, Yasuhito Fukuda
[jira] [Commented] (ZOOKEEPER-2218) Close IO Streams in finally block
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597135#comment-14597135 ] Tang Xinye commented on ZOOKEEPER-2218: --- oops! still learning, sorry for the mistake! Close IO Streams in finally block - Key: ZOOKEEPER-2218 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2218 Project: ZooKeeper Issue Type: Bug Reporter: Tang Xinye Priority: Critical The problem here is that if an exception is thrown during the read process the method will exit without closing the stream and hence without releasing the file system resources, it may run out of resources before it does run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2193) reconfig command completes even if parameter is wrong obviously
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasuhito Fukuda updated ZOOKEEPER-2193: --- Attachment: ZOOKEEPER-2193-v8.patch reconfig command completes even if parameter is wrong obviously --- Key: ZOOKEEPER-2193 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2193 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.5.0 Environment: CentOS7 + Java7 Reporter: Yasuhito Fukuda Assignee: Yasuhito Fukuda Attachments: ZOOKEEPER-2193-v2.patch, ZOOKEEPER-2193-v3.patch, ZOOKEEPER-2193-v4.patch, ZOOKEEPER-2193-v5.patch, ZOOKEEPER-2193-v6.patch, ZOOKEEPER-2193-v7.patch, ZOOKEEPER-2193-v8.patch, ZOOKEEPER-2193.patch Even if reconfig parameter is wrong, it was confirmed to complete. refer to the following. - Ensemble consists of four nodes {noformat} [zk: vm-101:2181(CONNECTED) 0] config server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant version=1 {noformat} - add node by reconfig command {noformat} [zk: vm-101:2181(CONNECTED) 9] reconfig -add server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 Committed new configuration: server.1=192.168.100.101:2888:3888:participant server.2=192.168.100.102:2888:3888:participant server.3=192.168.100.103:2888:3888:participant server.4=192.168.100.104:2888:3888:participant server.5=192.168.100.104:2888:3888:participant;0.0.0.0:2181 version=30007 {noformat} server.4 and server.5 of the IP address is a duplicate. In this state, reader election will not work properly. Besides, it is assumed an ensemble will be undesirable state. I think that need a parameter validation when reconfig. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597203#comment-14597203 ] Ziyou Wang commented on ZOOKEEPER-2172: --- Thanks for looking on this. I always suspect this problem may has relationship with the sync. Because I need to wait more time to avoid it when the cluster is running with a slow disk. I upload the log files after I add the log to record the quorum packet type. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ziyou Wang updated ZOOKEEPER-2172: -- Attachment: zoo-4-3.log zoo-4-2.log zoo-4-1.log Add log to record quorum packet types. Cluster crashes when reconfig a new node as a participant - Key: ZOOKEEPER-2172 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172 Project: ZooKeeper Issue Type: Bug Components: leaderElection, quorum, server Affects Versions: 3.5.0 Environment: Ubuntu 12.04 + java 7 Reporter: Ziyou Wang Priority: Critical Attachments: node-1.log, node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, zookeeper-2.log, zookeeper-3.log The operations are quite simple: start three zk servers one by one, then reconfig the cluster to add the new one as a participant. When I add the third one, the zk cluster may enter a weird state and cannot recover. I found “2015-04-20 12:53:48,236 [myid:1] - INFO [ProcessThread(sid:1 cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. So the first node received the reconfig cmd at 12:53:48. Latter, it logged “2015-04-20 12:53:52,230 [myid:1] - ERROR [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] - WARN [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE /10.0.0.2:55890 ”. From then on, the first node and second node rejected all client connections and the third node didn’t join the cluster as a participant. The whole cluster was done. When the problem happened, all three nodes just used the same dynamic config file zoo.cfg.dynamic.1005d which only contained the first two nodes. But there was another unused dynamic config file in node-1 directory zoo.cfg.dynamic.next which already contained three nodes. When I extended the waiting time between starting the third node and reconfiguring the cluster, the problem didn’t show again. So it should be a race condition problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-1907) Improve Thread handling
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595965#comment-14595965 ] Rakesh R commented on ZOOKEEPER-1907: - As per the [discussion|https://issues.apache.org/jira/browse/ZOOKEEPER-602?focusedCommentId=14547208page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14547208] re-opening this jira to backport the changes to {{branch-3.4}}. I will prepare a patch some time later this week. Improve Thread handling --- Key: ZOOKEEPER-1907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1907 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.5.0 Reporter: Rakesh R Assignee: Rakesh R Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch Server has many critical threads running and co-ordinating each other like RequestProcessor chains et. When going through each threads, most of them having the similar structure like: {code} public void run() { try { while(running) // processing logic } } catch (InterruptedException e) { LOG.error(Unexpected interruption, e); } catch (Exception e) { LOG.error(Unexpected exception, e); } LOG.info(...exited loop!); } {code} From the design I could see, there could be a chance of silently leaving the thread by swallowing the exception. If this happens in the production, the server would get hanged forever and would not be able to deliver its role. Now its hard for the management tool to detect this. The idea of this JIRA is to discuss and imprv. Reference: [Community discussion thread|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201403.mbox/%3cc2496325850aa74c92aaf83aa9662d26458a1...@szxeml561-mbx.china.huawei.com%3E] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (ZOOKEEPER-1907) Improve Thread handling
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rakesh R reopened ZOOKEEPER-1907: - Improve Thread handling --- Key: ZOOKEEPER-1907 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1907 Project: ZooKeeper Issue Type: Improvement Components: server Affects Versions: 3.5.0 Reporter: Rakesh R Assignee: Rakesh R Fix For: 3.5.1, 3.6.0 Attachments: ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch, ZOOKEEPER-1907.patch Server has many critical threads running and co-ordinating each other like RequestProcessor chains et. When going through each threads, most of them having the similar structure like: {code} public void run() { try { while(running) // processing logic } } catch (InterruptedException e) { LOG.error(Unexpected interruption, e); } catch (Exception e) { LOG.error(Unexpected exception, e); } LOG.info(...exited loop!); } {code} From the design I could see, there could be a chance of silently leaving the thread by swallowing the exception. If this happens in the production, the server would get hanged forever and would not be able to deliver its role. Now its hard for the management tool to detect this. The idea of this JIRA is to discuss and imprv. Reference: [Community discussion thread|http://mail-archives.apache.org/mod_mbox/zookeeper-user/201403.mbox/%3cc2496325850aa74c92aaf83aa9662d26458a1...@szxeml561-mbx.china.huawei.com%3E] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (ZOOKEEPER-602) log all exceptions not caught by ZK threads
[ https://issues.apache.org/jira/browse/ZOOKEEPER-602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595961#comment-14595961 ] Rakesh R commented on ZOOKEEPER-602: Thank you [~rgs] for the reviews and commit. Also, thank you [~fpj], [~hdeng] for the help in reviews. As per the [discussions in this jira|https://issues.apache.org/jira/browse/ZOOKEEPER-602?focusedCommentId=14547208page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14547208], I will re-open ZOOKEEPER-1907 for backporting it to {{branch-3.4}}. log all exceptions not caught by ZK threads --- Key: ZOOKEEPER-602 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-602 Project: ZooKeeper Issue Type: Bug Components: java client, server Affects Versions: 3.2.1 Reporter: Patrick Hunt Assignee: Rakesh R Priority: Blocker Fix For: 3.4.7, 3.5.0 Attachments: ZOOKEEPER-602-br3-4.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch, ZOOKEEPER-602.patch the java code should add a ThreadGroup exception handler that logs at ERROR level any uncaught exceptions thrown by Thread run methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332)