[
https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131328#comment-15131328
]
Flavio Junqueira commented on ZOOKEEPER-2247:
---------------------------------------------
[~rakesh_r] One simple thing I'd like to have changed is the timeout of the
test cases. Can we please use 30s by default?
I also had the same observation about the exception that [~cnauroth] made, and
there are a couple of other things I don't understand. In this loop:
{noformat}
+ while (zkServer.isStateRunning()) {
+ try {
+ Thread.sleep(1000); // watch interval
+ } catch (InterruptedException ie) {
+ LOG.info("Thread interrupted");
+ }
+ }
{noformat}
Shouldn't it be {{while (zk.isRunning()) {}} instead?
For the leader and learner, why is it {{isStateRunning}} here:
{noformat}
+ public boolean isRunning() {
+ return self.isRunning() && zk.isStateRunning();
+ }
{noformat}
and not this:
{noformat}
+ public boolean isRunning() {
+ return self.isRunning() && zk.isRunning();
+ }
{noformat}
The rationale is that we are running if both the peer is running and the server
is running, so just checking if the state is running isn't sufficient.
> Zookeeper service becomes unavailable when leader fails to write transaction
> log
> --------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2247
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.5.0
> Reporter: Arshad Mohammad
> Assignee: Arshad Mohammad
> Priority: Critical
> Fix For: 3.4.9, 3.5.2, 3.6.0
>
> Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch,
> ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch,
> ZOOKEEPER-2247-06.patch, ZOOKEEPER-2247-07.patch, ZOOKEEPER-2247-09.patch,
> ZOOKEEPER-2247-10.patch, ZOOKEEPER-2247-11.patch, ZOOKEEPER-2247-12.patch,
> ZOOKEEPER-2247-b3.5.patch
>
>
> Zookeeper service becomes unavailable when leader fails to write transaction
> log. Bellow are the exceptions
> {code}
> 2015-08-14 15:41:18,556 [myid:100] - ERROR
> [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error,
> from thread : SyncThread:100
> java.io.IOException: Input/output error
> at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
> at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
> at
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331)
> at
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380)
> at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178)
> at
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
> 2015-08-14 15:41:18,559 [myid:100] - INFO
> [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread
> SyncThread:100 exits, error code 1
> 2015-08-14 15:41:18,559 [myid:100] - INFO
> [SyncThread:100:ZooKeeperServer@523] - shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:SessionTrackerImpl@232] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO
> [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO
> [SyncThread:100:CommitProcessor@356] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO
> [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor
> complete
> 2015-08-14 15:41:18,562 [myid:100] - INFO
> [SyncThread:100:SyncRequestProcessor@191] - Shutting down
> 2015-08-14 15:41:18,563 [myid:100] - INFO [ProcessThread(sid:100
> cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop!
> {code}
> After this exception Leader server still remains leader. After this non
> recoverable exception the leader should go down and let other followers
> become leader.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)