[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976429#comment-14976429
 ] 

Rakesh R commented on ZOOKEEPER-2247:
-------------------------------------

Good work [~arshad.mohammad], thanks for taking care this. Overall patch looks 
nice, just few more comments, please take a look at it.

# Could you please try to use existing CountdownWatcher#waitForConnected() 
function rather than SimpleSysTest#waitForConnect. Reference 
[RemoveWatchesTest.java#L770|https://github.com/apache/zookeeper/blob/trunk/src/java/test/org/apache/zookeeper/RemoveWatchesTest.java#L770]
# I could see there are few improvements {{TestUtils.deleteFileRecursively}} 
done along with this jira, its really great work and I appreciate your 
initiatives. But I think this is not related to this jira. I'd suggest to push 
these changes through separate jira. IMHO, that would help the reviewers 
focusing only on the changes(minimal code changes) needed to handle the defect 
scenario and resolve the defect as early as possible.
# Generally constant sleeping in test case is problematic. One alternative 
approach I can quickly suggest is to use sliced sleeping in a while loop and 
check for new LE occurred. Each slice could be 500 milliseconds window and 
while loop can iterate/counter 20times or 30times before failing, probably 
would give overall bigger timeout.
{code}
+        
+        // give enough time, so that new leader election takes place
+        Thread.sleep(5000);
{code}
# I hope the below exception is expected. In that case, it would be good to add 
Assert.fail("Must throw exception as____") statement before catching the 
exception. That would make the test more better.
{code}
+        try {
+            // do create operation, so that injected IOException is thrown
+            zk.create(uniqueNode(), data.getBytes(), Ids.OPEN_ACL_UNSAFE,
+                    CreateMode.PERSISTENT);
//////////Include Assert.fail("Must throw exception as____") statement 
+        } catch (Exception e) {
+            /**
+             * intended IOException might have already thrown. so at this point
+             * of time connection may not be available
+             */
+            e.printStackTrace();
+        }
{code}
# minor suggestion: I'd prefer to add assert message in the unit test for 
better readability/debugging. Please add assertion messages.
{code}
assertNotNull(leader);
assertNotNull(currentleader);
assertEquals(uniqueNode, createNode);
{code}

> Zookeeper service becomes unavailable when leader fails to write transaction 
> log
> --------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2247
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2247
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.5.0
>            Reporter: Arshad Mohammad
>            Assignee: Arshad Mohammad
>            Priority: Critical
>             Fix For: 3.5.2
>
>         Attachments: ZOOKEEPER-2247-01.patch, ZOOKEEPER-2247-02.patch, 
> ZOOKEEPER-2247-03.patch, ZOOKEEPER-2247-04.patch, ZOOKEEPER-2247-05.patch
>
>
> Zookeeper service becomes unavailable when leader fails to write transaction 
> log. Bellow are the exceptions
> {code}
> 2015-08-14 15:41:18,556 [myid:100] - ERROR 
> [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error, 
> from thread : SyncThread:100
> java.io.IOException: Input/output error
>       at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
>       at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
>       at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
>       at 
> org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331)
>       at 
> org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380)
>       at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563)
>       at 
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178)
>       at 
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
> 2015-08-14 15:41:18,559 [myid:100] - INFO  
> [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread 
> SyncThread:100 exits, error code 1
> 2015-08-14 15:41:18,559 [myid:100] - INFO  
> [SyncThread:100:ZooKeeperServer@523] - shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:SessionTrackerImpl@232] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
> 2015-08-14 15:41:18,560 [myid:100] - INFO  
> [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO  
> [SyncThread:100:CommitProcessor@356] - Shutting down
> 2015-08-14 15:41:18,561 [myid:100] - INFO  
> [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor 
> complete
> 2015-08-14 15:41:18,562 [myid:100] - INFO  
> [SyncThread:100:SyncRequestProcessor@191] - Shutting down
> 2015-08-14 15:41:18,563 [myid:100] - INFO  [ProcessThread(sid:100 
> cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop!
> {code}
> After this exception Leader server still remains leader. After this non 
> recoverable exception the leader should go down and let other followers 
> become leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to