[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745372#action_12745372 ] Flavio Paiva Junqueira commented on ZOOKEEPER-512: -- I'm not convinced this is a bug. Right now it sounds to me that the problem is with the way you're injecting faults. More concretely, it sounds like some threads are getting IOException, but the corresponding socket is not closing. As recv and sender come in pairs, if one dies and the other doesn't, we have a problem. At the same time, I believe the current code would eventually terminate a pair of workers send/recv if the socket closes. It is true, though, that the current code assumes that if RecvWorker catches an IOException when performing an socket operation, then the corresponding SendWorker will also catch an exception when trying to write to the socket. This is where I think your framework is broken, but please correct me if I'm missing anything. FLE election fails to elect leader -- Key: ZOOKEEPER-512 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.2.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: jst.txt, logs.tar.gz I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader. See the attached log files - 5 member ensemble. typically 5 is the leader Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum environment: I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() = .005 = throw IOException You can see when a fault is injected in the log via: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] - READPACKET FORCED FAIL vs a read/write that didn't force fail: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] - READPACKET OK otw standard code/config (straight fle quorum with 5 members) also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745475#action_12745475 ] Patrick Hunt commented on ZOOKEEPER-512: Your explanation sounds reasonable, but I don't see anything in the java socket{channel} apis that talk about this. perhaps I missed it. Do you have a pointer to something that talks about this? (I did some searches and couldn't find). Basically, why should we assume that any ioexception results in the socket being closed? FLE election fails to elect leader -- Key: ZOOKEEPER-512 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.2.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: jst.txt, logs.tar.gz I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader. See the attached log files - 5 member ensemble. typically 5 is the leader Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum environment: I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() = .005 = throw IOException You can see when a fault is injected in the log via: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] - READPACKET FORCED FAIL vs a read/write that didn't force fail: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] - READPACKET OK otw standard code/config (straight fle quorum with 5 members) also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-512: --- Attachment: logs2.tar.gz take a look at logs2, this is similar fault injection model, however I'm now: sock.close() throw IOException rather than just throwing the ioexception. otw basically the same test as before. Notice that 1 drops off alot earlier than the rest (seems due to it's server id being the lowest?) FLE election fails to elect leader -- Key: ZOOKEEPER-512 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.2.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader. See the attached log files - 5 member ensemble. typically 5 is the leader Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum environment: I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() = .005 = throw IOException You can see when a fault is injected in the log via: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] - READPACKET FORCED FAIL vs a read/write that didn't force fail: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] - READPACKET OK otw standard code/config (straight fle quorum with 5 members) also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader
[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745632#action_12745632 ] Patrick Hunt commented on ZOOKEEPER-512: sorry, to be overly clear -- the same problem occurs in this case (close/throw) -- the quorum cannot be formed after some time. FLE election fails to elect leader -- Key: ZOOKEEPER-512 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 Project: Zookeeper Issue Type: Bug Components: quorum, server Affects Versions: 3.2.0 Reporter: Patrick Hunt Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader. See the attached log files - 5 member ensemble. typically 5 is the leader Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum environment: I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() = .005 = throw IOException You can see when a fault is injected in the log via: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] - READPACKET FORCED FAIL vs a read/write that didn't force fail: 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] - READPACKET OK otw standard code/config (straight fle quorum with 5 members) also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745683#action_12745683 ] Mahadev konar commented on ZOOKEEPER-508: - i fugured out the reason why it fails on the assertion errors != 0. The whole scenario of 483 fialing with truncate is this - the test case shutdowns all the followers - the leader does not realize that its lost the leadership, becasue the time we ping to see if the leader is still the leader (maybe 1 sec) is greater than the time hte followers actually take to shutdown and get back and in sync with the leader - so the leader never shutsdown any of its stuff (no NIO rejection or nething else) so in your case, sometimes the client conencts to the leader and will never see errors. On the other hand sometimes it may pas s on connection to other followers and your testcase will pass. So we cannot really say that your test case is fool proof. proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-508: Attachment: ZOOKEEPER-508.patch this patch includes fix for ZOOKEEPER-508, ZOOKEEPER-509, ZOOKEEPER-483. It also includes the test cases for each of them. proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-508: Status: Patch Available (was: Open) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-483) ZK fataled on me, and ugly
[ https://issues.apache.org/jira/browse/ZOOKEEPER-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-483: Status: Open (was: Patch Available) included in ZOOKEEPER-508 ZK fataled on me, and ugly -- Key: ZOOKEEPER-483 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-483 Project: Zookeeper Issue Type: Bug Affects Versions: 3.1.1 Reporter: ryan rawson Assignee: Benjamin Reed Fix For: 3.2.1, 3.3.0 Attachments: QuorumTest.log, QuorumTest.log.gz, zklogs.tar.gz, ZOOKEEPER-483.patch, ZOOKEEPER-483.patch, ZOOKEEPER-483.patch, ZOOKEEPER-483.patch here are the part of the log whereby my zookeeper instance crashed, taking 3 out of 5 down, and thus ruining the quorum for all clients: 2009-07-23 12:29:06,769 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x52276d1d5161350 due to java.io.IOException: Read error 2009-07-23 12:29:00,756 WARN org.apache.zookeeper.server.quorum.Follower: Exception when following the leader java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:65) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Follower.readPacket(Follower.java:114) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:243) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:494) 2009-07-23 12:29:06,770 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5161350 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.168:39489] 2009-07-23 12:29:06,770 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0578 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46797] 2009-07-23 12:29:06,771 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa013e NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.153:33998] 2009-07-23 12:29:06,771 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x52276d1d5160593 due to java.io.IOException: Read error 2009-07-23 12:29:06,808 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e02bb NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.158:53758] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13e4 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.154:58681] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691382 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59967] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb1354 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.163:49957] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13cd NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.150:34212] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691383 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46813] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0350 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59956] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e139b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.156:55138] 2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e1398 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.167:41257] 2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5161355 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.153:34032] 2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing
[jira] Commented: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745733#action_12745733 ] Mahadev konar commented on ZOOKEEPER-508: - I have a patch for the 3.2 branch, will upload is as soon as hudson is done running this patch. proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745755#action_12745755 ] Hadoop QA commented on ZOOKEEPER-508: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417189/ZOOKEEPER-508.patch against trunk revision 803300. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. -1 release audit. The applied patch generated 178 release audit warnings (more than the trunk's current 177 warnings). -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/188/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/188/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/188/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/188/console This message is automatically generated. proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-508) proposals and commits for DIFF and Truncate messages from the leader to followers is buggy.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mahadev konar updated ZOOKEEPER-508: Attachment: ZOOKEEPER-508.patch attached missing header to the file. proposals and commits for DIFF and Truncate messages from the leader to followers is buggy. --- Key: ZOOKEEPER-508 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-508 Project: Zookeeper Issue Type: Bug Components: quorum Reporter: Mahadev konar Assignee: Mahadev konar Priority: Blocker Fix For: 3.2.1, 3.3.0 Attachments: ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch, ZOOKEEPER-508.patch The proposals and commits sent by the leader after it asks the followers to truncate there logs or starts sending a diff has missing messages which causes out of order commits messages and causes the followers to shutdown because of these out of order commits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.