[ https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131644#comment-15131644 ]
Rakesh R commented on ZOOKEEPER-2247: ------------------------------------- Thanks [~fpj] for the comments. bq. Can we please use 30s by default? Agreed. bq. Shouldn't it be while (zk.isRunning()) instead? {code} public boolean isRunning() { return state == State.RUNNING || state == State.ERROR; } public boolean isStateRunning() { return state == State.RUNNING; } {code} State transitions: 1. At the beginning the state will be {{INITIAL}}. 2. After the successful start, update the server state to {{RUNNING}} 3. When there is an internal error, update the server state to {{ERROR}}. 4. On shutdown, update the server state to {{SHUTDOWN}} Standlone server watch logic: The newly added watch logic will periodically checks {{RUNNING}} state and come out of the loop if it sees a state other than {{RUNNING}}. With {{zks.isRunning()}} method, it will return true if server is {{RUNNING}} or {{ERROR}} state. So if I use {{isRunning()}}, it will never come out of the loop on error situations, right? bq. For the leader and learner, why is it isStateRunning here: Here also the same case. It should come out of the {{readPacket}} function if the server is not RUNNING. With {{zks.isRunning()}}, it will never identify the {{ERROR}} state and continue reading packet, right? {{isRunning}} method is reflecting dual state, {{running}} as well as {{running with an error}}, I think that causes the confusion. I failed to find a better name for this function. > Zookeeper service becomes unavailable when leader fails to write transaction > log > -------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2247 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.5.0 > Reporter: Arshad Mohammad > Assignee: Arshad Mohammad > Priority: Critical > Fix For: 3.4.9, 3.5.2, 3.6.0 > > Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch, > ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch, > ZOOKEEPER-2247-06.patch, ZOOKEEPER-2247-07.patch, ZOOKEEPER-2247-09.patch, > ZOOKEEPER-2247-10.patch, ZOOKEEPER-2247-11.patch, ZOOKEEPER-2247-12.patch, > ZOOKEEPER-2247-b3.5.patch > > > Zookeeper service becomes unavailable when leader fails to write transaction > log. Bellow are the exceptions > {code} > 2015-08-14 15:41:18,556 [myid:100] - ERROR > [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error, > from thread : SyncThread:100 > java.io.IOException: Input/output error > at sun.nio.ch.FileDispatcherImpl.force0(Native Method) > at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) > at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376) > at > org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331) > at > org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380) > at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113) > 2015-08-14 15:41:18,559 [myid:100] - INFO > [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread > SyncThread:100 exits, error code 1 > 2015-08-14 15:41:18,559 [myid:100] - INFO > [SyncThread:100:ZooKeeperServer@523] - shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:SessionTrackerImpl@232] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:LeaderRequestProcessor@77] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:PrepRequestProcessor@1035] - Shutting down > 2015-08-14 15:41:18,560 [myid:100] - INFO > [SyncThread:100:ProposalRequestProcessor@88] - Shutting down > 2015-08-14 15:41:18,561 [myid:100] - INFO > [SyncThread:100:CommitProcessor@356] - Shutting down > 2015-08-14 15:41:18,561 [myid:100] - INFO > [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop! > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor > complete > 2015-08-14 15:41:18,562 [myid:100] - INFO > [SyncThread:100:SyncRequestProcessor@191] - Shutting down > 2015-08-14 15:41:18,563 [myid:100] - INFO [ProcessThread(sid:100 > cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop! > {code} > After this exception Leader server still remains leader. After this non > recoverable exception the leader should go down and let other followers > become leader. -- This message was sent by Atlassian JIRA (v6.3.4#6332)