[
https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114748#comment-15114748
]
Rakesh R commented on ZOOKEEPER-2247:
-------------------------------------
Thanks Flavio for pointing out the multiple execution paths.
bq. Could anyone explain to me why we aren't simply relying on the finally
blocks?
When there is an uncaught exception thrown by any of the internal critical
threads, QuourmPeer doesn't have any mechanism to know that internal error
state. He still continue with the #readPacket(). For example,
[Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88]
will continue reading without knowing that error. To execute the finally
blocks there should be a way to stop this reading logic. So as part of
ZOOKEEPER-1907 design discussions, the point has come up to introduce a
listening mechanism which will take action and gracefully bring down the
QuourmPeer. This made another execution path that change the state of the
server.
bq. If we can do it, I'd much rather have this option implemented rather than
multiple code paths that change the state of the server.
I understand your point. How about introducing a polling mechanism at
QuorumPeer. Presently ZooKeeperServerListener is taking the decision to
shutdown the server, instead of this ZooKeeperServerListener will just mark the
internal error state only. Later while polling QuorumPeer will see this error
and exits the loop gracefully.
The idea is something like, ZooKeeper server will maintain an
{{internalErrorState}}, which will be then used by the QuorumPeer while reading
the packet. If QuorumPeer sees an error then will break and executes the
finally block. On the other side, all the threads will use
ZooKeeperServerListener. He will listen the unexpected errors and notify the
QuourmPeer about that error by setting {{zk.setInternalErrorState(true)}} to
true.
QuourmPeer should have a logic like,
{code}
while (self.isRunning() && !zk.hasInternalError()) {
readPacket(qp);
processPacket(qp);
}
{code}
Similar polling mechanism has to be introduced at the standalone server
[ZooKeeperServerMain.java#L149|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java#L149]
as well.
I don't think we need to worry about the other internal exceptions which can
occur before the ZK server enters into the #readPacket() state
[Follower.java#L88|https://github.com/apache/zookeeper/blob/trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L88].
I hope all these errors will come out and stops the server gracefully. Please
correct me if I'm missing any other cases.
> Zookeeper service becomes unavailable when leader fails to write transaction
> log
> --------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2247
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.0
> Reporter: Arshad Mohammad
> Assignee: Arshad Mohammad
> Priority: Critical
> Fix For: 3.4.8, 3.5.2
>
> Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch,
> ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch,
> ZOOKEEPER-2247-06.patch
>
>
> Zookeeper service becomes unavailable when leader fails to write transaction
> log. Bellow are the exceptions
> {code}
> 2015-08-14 15:41:18,556 [myid:100] - ERROR
> [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error,
> from thread : SyncThread:100
> java.io.IOException: Input/output error
> at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
> at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380)
> at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
> 2015-08-14 15:41:18,559 [myid:100] - INFO
> [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread
> SyncThread:100 exits, error code 1
> 2015-08-14 15:41:18,559 [myid:100] - INFO
> [SyncThread:100:ZooKeeperServer@523] - shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:SessionTrackerImpl@232] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO
> [SyncThread:100:CommitProcessor@356] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO
> [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor
> complete
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:SyncRequestProcessor@191] - Shutting down
> 2015-08-14 15:41:18,563 [myid:100] - INFO [ProcessThread(sid:100
> cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop!
> {code}
> After this exception Leader server still remains leader. After this non
> recoverable exception the leader should go down and let other followers
> become leader.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)