[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745870#action_12745870 ]
Flavio Paiva Junqueira commented on ZOOKEEPER-512: -------------------------------------------------- Two things: 1- I'm not sure what you've been searching for, so I don't have a pointer, but the behavior I expect is that if you get an IOException upon invoking a socket operation, then the operation won't be available after that. Am I not interpreting it correctly? 2- Visually expecting the logs, I was able to count about 20 successful leader elections. In the previous set of logs, I think servers got stuck around 5, so I see improvement after you modified your fault injection. Also, according to server 5, a leader was elected successfully. Here is the tail of the log of 5: {noformat} 2009-08-20 13:43:56,636 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@508] - LEADING 2009-08-20 13:43:56,636 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:zookeeperser...@160] - Created server 2009-08-20 13:43:56,643 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:files...@81] - Reading snapshot ./localhost:2185/data/version-2/snapshot.1c0000001f 2009-08-20 13:43:56,699 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:filetxnsnap...@208] - Snapshotting: 1c0000001f 2009-08-20 13:43:56,844 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:followerhand...@227] - Follower sid: 4 : info : org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@1b1aa65 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,846 - WARN [FollowerHandler-/127.0.0.1:55253:followerhand...@302] - Sending snapshot last zxid of peer is 0x1c0000001f zxid of leader is 0x1d00000000 2009-08-20 13:43:56,847 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,848 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,848 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,849 - INFO [FollowerHandler-/127.0.0.1:55254:followerhand...@227] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@1ef9157 2009-08-20 13:43:56,849 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,850 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,850 - WARN [FollowerHandler-/127.0.0.1:55254:followerhand...@302] - Sending snapshot last zxid of peer is 0x1c0000001f zxid of leader is 0x1d00000000 2009-08-20 13:43:56,851 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:56,851 - WARN [FollowerHandler-/127.0.0.1:55254:lea...@452] - Commiting zxid 0x1d00000000 from /127.0.0.1:3185 not first! 2009-08-20 13:43:56,852 - WARN [FollowerHandler-/127.0.0.1:55254:lea...@454] - First is 0 2009-08-20 13:43:56,852 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:59,434 - INFO [Thread-536:requestfailsintermitten...@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,434 - INFO [Thread-536:requestfailsintermitten...@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,435 - INFO [WorkerReceiver Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new notification. 2009-08-20 13:43:59,435 - INFO [Thread-535:requestfailsintermitten...@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,846 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:43:59,848 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:00,765 - INFO [NIOServerCxn.Factory:2185:nioserverc...@698] - Processing stat command from /127.0.0.1:38350 2009-08-20 13:44:00,766 - WARN [NIOServerCxn.Factory:2185:nioserverc...@494] - Exception causing close of session 0x0 due to java.io.IOException: Responded to info probe 2009-08-20 13:44:00,766 - INFO [NIOServerCxn.Factory:2185:nioserverc...@833] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2185 remote=/127.0.0.1:38350] 2009-08-20 13:44:00,846 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:00,848 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:01,847 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:01,848 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:02,847 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:02,848 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:03,847 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:03,849 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:04,847 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:04,849 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:05,847 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:05,849 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:06,849 - INFO [FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD REQUEST OK 2009-08-20 13:44:06,849 - INFO [FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD REQUEST OK {noformat} Inspecting the logs a little more, it is true, though, that 1 and 3 are stuck. It could be due to the challenge protocol as you suggest, so we should keep it in mind. > FLE election fails to elect leader > ---------------------------------- > > Key: ZOOKEEPER-512 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.2.0 > Reporter: Patrick Hunt > Priority: Blocker > Fix For: 3.2.1, 3.3.0 > > Attachments: jst.txt, logs.tar.gz, logs2.tar.gz > > > I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch > applied and noticed that after some time the ensemble failed to re-elect a > leader. > See the attached log files - 5 member ensemble. typically 5 is the leader > Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes > elapses w/no quorum > environment: > I was doing fault injection testing using aspectj. The faults are injected > into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio > (rand.nextFloat() <= .005 => throw IOException > You can see when a fault is injected in the log via: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] > - READPACKET FORCED FAIL > vs a read/write that didn't force fail: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] > - READPACKET OK > otw standard code/config (straight fle quorum with 5 members) > also see the attached jstack trace. this is for one of the servers. Notice in > particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.