[jira] [Updated] (ZOOKEEPER-2867) an expired ZK session can be re-established
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao updated ZOOKEEPER-2867: --- Attachment: zk.2.08-02 zk.0.08-02 zk.1.08-02 [~hanm], I am attaching the log from 3 ZK servers on Aug. 2. What happened was the following. 1. All 3 ZK servers went down around 23:16 and were restarted around 23:41. 2. Kafka controller, which never went down during that window, was able to re-establish its ZK session around 23:41:58. {code:java} August 2nd 2017, 23:41:58.499 INFOorg.apache.zookeeper.ClientCnxn Socket connection established to zookeeper-2.cp14.svc.cluster.local/100.71.124.93:2181, initiating session {code} 3. Kafka broker 0 (non-controller) went down on 23:17:18 and was restarted on 23:42:12. Supposedly, the old ZK session for broker 0 (25cd1e82c110001) should be expired after the ZK servers were restarted (but it didn't seem to happen). When will the clock for session expiration starts when the ZK cluster was restarted? After the ZK cluster has elected a leader? > an expired ZK session can be re-established > --- > > Key: ZOOKEEPER-2867 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2867 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.10 >Reporter: Jun Rao > Attachments: zk.0.08-02, zk.0.formatted, zk.1.08-02, zk.1.formatted, > zk.2.08-02 > > > Not sure if this is a real bug, but I found an instance when a ZK client > seems to be able to renew a session already expired by the ZK server. > From ZK server log, session 25cd1e82c110001 was expired at 22:04:39. > {code:java} > June 27th 2017, 22:04:39.000 INFO > org.apache.zookeeper.server.ZooKeeperServer Expiring session > 0x25cd1e82c110001, timeout of 12000ms exceeded > June 27th 2017, 22:04:39.001 DEBUG > org.apache.zookeeper.server.quorum.Leader Proposing:: > sessionid:0x25cd1e82c110001 type:closeSession cxid:0x0 zxid:0x20fc4 > txntype:-11 reqpath:n/a > June 27th 2017, 22:04:39.001 INFO > org.apache.zookeeper.server.PrepRequestProcessorProcessed session > termination for sessionid: 0x25cd1e82c110001 > June 27th 2017, 22:04:39.001 DEBUG > org.apache.zookeeper.server.quorum.CommitProcessor Processing request:: > sessionid:0x25cd1e82c110001 type:closeSession cxid:0x0 zxid:0x20fc4 > txntype:-11 reqpath:n/a > June 27th 2017, 22:05:20.324 INFO > org.apache.zookeeper.server.quorum.Learner Revalidating client: > 0x25cd1e82c110001 > June 27th 2017, 22:05:20.324 INFO > org.apache.zookeeper.server.ZooKeeperServer Client attempting to renew > session 0x25cd1e82c110001 at /100.96.5.6:47618 > June 27th 2017, 22:05:20.325 INFO > org.apache.zookeeper.server.ZooKeeperServer Established session > 0x25cd1e82c110001 with negotiated timeout 12000 for client /100.96.5.6:47618 > {code} > From ZK client's log, it was able to renew the expired session on 22:05:20. > {code:java} > June 27th 2017, 22:05:18.590 INFOorg.apache.zookeeper.ClientCnxn Client > session timed out, have not heard from server in 4485ms for sessionid > 0x25cd1e82c110001, closing socket connection and attempting reconnect 0 > June 27th 2017, 22:05:18.590 WARNorg.apache.zookeeper.ClientCnxn Client > session timed out, have not heard from server in 4485ms for sessionid > 0x25cd1e82c110001 0 > June 27th 2017, 22:05:19.325 WARNorg.apache.zookeeper.ClientCnxn SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/opt/confluent/etc/kafka/server_jaas.conf'. Will > continue connection to Zookeeper server without SASL authentication, if > Zookeeper server allows it. 0 > June 27th 2017, 22:05:19.326 INFOorg.apache.zookeeper.ClientCnxn Opening > socket connection to server 100.65.188.168/100.65.188.168:2181 0 > June 27th 2017, 22:05:20.324 INFOorg.apache.zookeeper.ClientCnxn Socket > connection established to 100.65.188.168/100.65.188.168:2181, initiating > session 0 > June 27th 2017, 22:05:20.327 INFOorg.apache.zookeeper.ClientCnxn Session > establishment complete on server 100.65.188.168/100.65.188.168:2181, > sessionid = 0x25cd1e82c110001, negotiated timeout = 12000 0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abraham Fine reassigned ZOOKEEPER-1360: --- Assignee: Abraham Fine (was: Henry Robinson) > QuorumTest.testNoLogBeforeLeaderEstablishment has several problems > -- > > Key: ZOOKEEPER-1360 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.4.2 >Reporter: Henry Robinson >Assignee: Abraham Fine > Fix For: 3.5.4, 3.6.0 > > > After the apparently valid fix to ZOOKEEPER-1294, > testNoLogBeforeLeaderEstablishment is failing for me about one time in four. > While I'll investigate whether the patch is 1294 is ultimately to blame, > reading the test brought to light a number of issues that appear to be bugs > or in need of improvement: > * As part of QuorumTest, an ensemble is already established by the fixture > setup code, but apparently unused by the test which uses QuorumUtil. > * The test reads QuorumPeer.leader and QuorumPeer.follower without > synchronization, which means that writes to those fields may not be published > when we come to read them. > * The return value of sem.tryAcquire is never checked. > * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and > inscrutable numbers of iterations through the main loop (e.g. the semaphore > blocking the final asserts is released only after the 2th of 5 > callbacks) > * The test as a whole takes ~30s to run > The first three are easy to fix (as part of fixing the second, I intend to > hide all members of QuorumPeer behind getters and setters), the fourth and > fifth need a slightly deeper understanding of what the test is trying to > achieve. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126316#comment-16126316 ] Henry Robinson commented on ZOOKEEPER-1360: --- Not at all - haven't looked at this in years! > QuorumTest.testNoLogBeforeLeaderEstablishment has several problems > -- > > Key: ZOOKEEPER-1360 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.4.2 >Reporter: Henry Robinson >Assignee: Henry Robinson > Fix For: 3.5.4, 3.6.0 > > > After the apparently valid fix to ZOOKEEPER-1294, > testNoLogBeforeLeaderEstablishment is failing for me about one time in four. > While I'll investigate whether the patch is 1294 is ultimately to blame, > reading the test brought to light a number of issues that appear to be bugs > or in need of improvement: > * As part of QuorumTest, an ensemble is already established by the fixture > setup code, but apparently unused by the test which uses QuorumUtil. > * The test reads QuorumPeer.leader and QuorumPeer.follower without > synchronization, which means that writes to those fields may not be published > when we come to read them. > * The return value of sem.tryAcquire is never checked. > * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and > inscrutable numbers of iterations through the main loop (e.g. the semaphore > blocking the final asserts is released only after the 2th of 5 > callbacks) > * The test as a whole takes ~30s to run > The first three are easy to fix (as part of fixing the second, I intend to > hide all members of QuorumPeer behind getters and setters), the fourth and > fifth need a slightly deeper understanding of what the test is trying to > achieve. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126314#comment-16126314 ] Abraham Fine commented on ZOOKEEPER-1360: - [~henryr] do you mind if I take a look at this? > QuorumTest.testNoLogBeforeLeaderEstablishment has several problems > -- > > Key: ZOOKEEPER-1360 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360 > Project: ZooKeeper > Issue Type: Bug > Components: tests >Affects Versions: 3.4.2 >Reporter: Henry Robinson >Assignee: Henry Robinson > Fix For: 3.5.4, 3.6.0 > > > After the apparently valid fix to ZOOKEEPER-1294, > testNoLogBeforeLeaderEstablishment is failing for me about one time in four. > While I'll investigate whether the patch is 1294 is ultimately to blame, > reading the test brought to light a number of issues that appear to be bugs > or in need of improvement: > * As part of QuorumTest, an ensemble is already established by the fixture > setup code, but apparently unused by the test which uses QuorumUtil. > * The test reads QuorumPeer.leader and QuorumPeer.follower without > synchronization, which means that writes to those fields may not be published > when we come to read them. > * The return value of sem.tryAcquire is never checked. > * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and > inscrutable numbers of iterations through the main loop (e.g. the semaphore > blocking the final asserts is released only after the 2th of 5 > callbacks) > * The test as a whole takes ~30s to run > The first three are easy to fix (as part of fixing the second, I intend to > hide all members of QuorumPeer behind getters and setters), the fourth and > fifth need a slightly deeper understanding of what the test is trying to > achieve. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
ZooKeeper_branch34_openjdk7 - Build # 1610 - Failure
See https://builds.apache.org/job/ZooKeeper_branch34_openjdk7/1610/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer [EnvInject] - Loading node environment variables. Building remotely on qnode1 (ubuntu) in workspace /home/jenkins/jenkins-slave/workspace/ZooKeeper_branch34_openjdk7 > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url git://git.apache.org/zookeeper.git # timeout=10 Cleaning workspace > git rev-parse --verify HEAD # timeout=10 Resetting working tree > git reset --hard # timeout=10 > git clean -fdx # timeout=10 Fetching upstream changes from git://git.apache.org/zookeeper.git > git --version # timeout=10 > git fetch --tags --progress git://git.apache.org/zookeeper.git > +refs/heads/*:refs/remotes/origin/* > git rev-parse refs/remotes/origin/branch-3.4^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/branch-3.4^{commit} # timeout=10 Checking out Revision 1f811a6281090e1b24152dc51507aa6a2bdeafe3 (refs/remotes/origin/branch-3.4) Commit message: "ZOOKEEPER-2859: Fix CMake build on OS X." > git config core.sparsecheckout # timeout=10 > git checkout -f 1f811a6281090e1b24152dc51507aa6a2bdeafe3 > git rev-list 1f811a6281090e1b24152dc51507aa6a2bdeafe3 # timeout=10 No emails were triggered. [ZooKeeper_branch34_openjdk7] $ /home/jenkins/tools/ant/apache-ant-1.9.9/bin/ant -Dtest.output=yes -Dtest.junit.threads=8 -Dtest.junit.output.format=xml -Djavac.target=1.7 clean test-core-java Error: JAVA_HOME is not defined correctly. We cannot execute /usr/lib/jvm/java-7-openjdk-amd64//bin/java Build step 'Invoke Ant' marked build as failure Recording test results ERROR: Step ‘Publish JUnit test result report’ failed: No test report files were found. Configuration error? Email was triggered for: Failure - Any Sending email for trigger: Failure - Any ### ## FAILED TESTS (if any) ## No tests ran.
ZooKeeper_branch35_jdk7 - Build # 1076 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch35_jdk7/1076/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 71.76 MB...] [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [junit] at java.lang.Thread.run(Thread.java:745) [junit] 2017-08-14 08:50:21,631 [myid:] - WARN [New I/O boss #3723:ClientCnxnSocketNetty$ZKClientHandler@439] - Exception caught: [id: 0xd67bb607] EXCEPTION: java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:19365 [junit] java.net.ConnectException: Connection refused: 127.0.0.1/127.0.0.1:19365 [junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) [junit] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79) [junit] at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [junit] at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42) [junit] at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [junit] at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [junit] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [junit] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [junit] at java.lang.Thread.run(Thread.java:745) [junit] 2017-08-14 08:50:21,631 [myid:] - INFO [New I/O boss #3723:ClientCnxnSocketNetty@208] - channel is told closing [junit] 2017-08-14 08:50:21,631 [myid:127.0.0.1:19365] - INFO [main-SendThread(127.0.0.1:19365):ClientCnxn$SendThread@1231] - channel for sessionid 0x105595b2828 is lost, closing socket connection and attempting reconnect [junit] 2017-08-14 08:50:21,691 [myid:] - INFO [ProcessThread(sid:0 cport:19547)::PrepRequestProcessor@611] - Processed session termination for sessionid: 0x105596142e1 [junit] 2017-08-14 08:50:21,694 [myid:] - INFO [SyncThread:0:MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=StandaloneServer_port19547,name1=Connections,name2=127.0.0.1,name3=0x105596142e1] [junit] 2017-08-14 08:50:21,695 [myid:] - INFO [main:ClientCnxnSocketNetty@208] - channel is told closing [junit] 2017-08-14 08:50:21,695 [myid:] - INFO [New I/O worker #8334:ClientCnxnSocketNetty$ZKClientHandler@384] - channel is disconnected: [id: 0x49c543bb, /127.0.0.1:49258 :> 127.0.0.1/127.0.0.1:19547] [junit] 2017-08-14 08:50:21,695 [myid:] - INFO [New I/O worker #8334:ClientCnxnSocketNetty@208] - channel is told closing [junit] 2017-08-14 08:50:21,696 [myid:] - INFO [main:ZooKeeper@1334] - Session: 0x105596142e1 closed [junit] 2017-08-14 08:50:21,696 [myid:] - INFO [main-EventThread:ClientCnxn$EventThread@513] - EventThread shut down for session: 0x105596142e1 [junit] 2017-08-14 08:50:21,696 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@82] - Memory used 108951 [junit] 2017-08-14 08:50:21,696 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@87] - Number of threads 988 [junit] 2017-08-14 08:50:21,697 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@102] - FINISHED TEST METHOD testWatcherAutoResetWithLocal [junit] 2017-08-14 08:50:21,697 [myid:] - INFO [main:ClientBase@586] - tearDown starting [junit] 2017-08-14 08:50:21,698 [myid:] - INFO [main:ClientBase@556] - STOPPING server [junit] 2017-08-14 08:50:21,698 [myid:] - INFO [main:NettyServerCnxnFactory@464] - shutdown called 0.0.0.0/0.0.0.0:19547 [junit] 2017-08-14 08:50:21,703 [myid:] - INFO [main:ZooKeeperServer@541] - shutting down [junit] 2017-08-14 08:50:21,704 [myid:] - ERROR [main:ZooKeeperServer@505] - ZKShutdownHandler is not registered, so ZooKeeper server won't take any action on ERROR or SHUTDOWN server state changes [junit] 2017-08-14 08:50:21,704 [myid:] - INFO [main:SessionTrackerImpl@232] - Shutting down [junit] 2017-08-14 08:50:21,704 [myid:] - INFO [main:PrepRequestProcessor@1005] - Shutting down [junit] 2017-08-14 08:50:21,704 [myid:] - INFO [main:SyncRequestProcessor@191] - Shutting down [junit] 2017-08-14 08:50:21,704 [myid:] - INFO [ProcessThread(sid:0 cport:19547)::PrepRequestProcessor@155] - PrepRequestProcessor exited loop! [junit] 2017-08-14 08:50:21,704 [myid:] - INFO [SyncThread:0:SyncRequestProcessor@169] - SyncRequestProcessor exited! [junit] 2017-08-14
Success: ZOOKEEPER- PreCommit Build #941
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 75.59 MB...] [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +0 tests included. The patch appears to be a documentation patch that doesn't require tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] 1b57c837719cded1d9f7a6e2a8ae9ee3d927b362 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] [exec] mv: ‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess’ and ‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess’ are the same file BUILD SUCCESSFUL Total time: 20 minutes 25 seconds Archiving artifacts Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Recording test results Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 [description-setter] Description set: ZOOKEEPER-2836 Putting comment on the pull request Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Email was triggered for: Success Sending email for trigger: Success Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7 ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125339#comment-16125339 ] Hadoop QA commented on ZOOKEEPER-2836: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//console This message is automatically generated. > QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException > -- > > Key: ZOOKEEPER-2836 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum >Affects Versions: 3.4.6 > Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 > x86_64 GNU/Linux > Java Version: jdk64/jdk1.8.0_40 > zookeeper version: 3.4.6.2.3.2.0-2950 >Reporter: Amarjeet Singh >Priority: Critical > > QuorumCnxManager Listener thread blocks SocketServer on accept but we are > getting SocketTimeoutException on our boxes after 49days 17 hours . As per > current code there is a 3 times retry and after that it says "_As I'm leaving > the listener thread, I won't be able to participate in leader election any > longer: $/$:3888__" , Once server nodes reache this state and > we restart or add a new node ,it fails to join cluster and logs 'WARN >
[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125335#comment-16125335 ] ASF GitHub Bot commented on ZOOKEEPER-2836: --- Github user bitgaoshu commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/334#discussion_r132887998 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java --- @@ -647,11 +648,11 @@ public void run() { numRetries = 0; } } catch (IOException e) { -if (shutdown) { -break; -} LOG.error("Exception while listening", e); -numRetries++; +if (!(e instanceof SocketTimeoutException)) { +numRetries++; +} +}finally { --- End diff -- it's my first time to commit code on github. i open a new [pr](https://github.com/apache/zookeeper/pull/336), which has fixed according to your opinion. I am sorry for my inconvenience. > QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException > -- > > Key: ZOOKEEPER-2836 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum >Affects Versions: 3.4.6 > Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 > x86_64 GNU/Linux > Java Version: jdk64/jdk1.8.0_40 > zookeeper version: 3.4.6.2.3.2.0-2950 >Reporter: Amarjeet Singh >Priority: Critical > > QuorumCnxManager Listener thread blocks SocketServer on accept but we are > getting SocketTimeoutException on our boxes after 49days 17 hours . As per > current code there is a 3 times retry and after that it says "_As I'm leaving > the listener thread, I won't be able to participate in leader election any > longer: $/$:3888__" , Once server nodes reache this state and > we restart or add a new node ,it fails to join cluster and logs 'WARN >
[GitHub] zookeeper pull request #334: ZOOKEEPER-2836 SocketTimeoutException
Github user bitgaoshu commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/334#discussion_r132887998 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java --- @@ -647,11 +648,11 @@ public void run() { numRetries = 0; } } catch (IOException e) { -if (shutdown) { -break; -} LOG.error("Exception while listening", e); -numRetries++; +if (!(e instanceof SocketTimeoutException)) { +numRetries++; +} +}finally { --- End diff -- it's my first time to commit code on github. i open a new [pr](https://github.com/apache/zookeeper/pull/336), which has fixed according to your opinion. I am sorry for my inconvenience. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125320#comment-16125320 ] ASF GitHub Bot commented on ZOOKEEPER-2836: --- GitHub user bitgaoshu opened a pull request: https://github.com/apache/zookeeper/pull/336 ZOOKEEPER-2836 fix SocketTimeoutException You can merge this pull request into a Git repository by running: $ git pull https://github.com/bitgaoshu/zookeeper ZOOKEEPER-2836 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/336.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #336 commit 3653e6ddc21589355fb06c98aa60665fce4a4e24 Author: bitgaoshuDate: 2017-08-14T07:02:16Z ZOOKEEPER-2836 fix SocketTimeoutException > QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException > -- > > Key: ZOOKEEPER-2836 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum >Affects Versions: 3.4.6 > Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 > x86_64 GNU/Linux > Java Version: jdk64/jdk1.8.0_40 > zookeeper version: 3.4.6.2.3.2.0-2950 >Reporter: Amarjeet Singh >Priority: Critical > > QuorumCnxManager Listener thread blocks SocketServer on accept but we are > getting SocketTimeoutException on our boxes after 49days 17 hours . As per > current code there is a 3 times retry and after that it says "_As I'm leaving > the listener thread, I won't be able to participate in leader election any > longer: $/$:3888__" , Once server nodes reache this state and > we restart or add a new node ,it fails to join cluster and logs 'WARN >
[GitHub] zookeeper pull request #336: ZOOKEEPER-2836 fix SocketTimeoutException
GitHub user bitgaoshu opened a pull request: https://github.com/apache/zookeeper/pull/336 ZOOKEEPER-2836 fix SocketTimeoutException You can merge this pull request into a Git repository by running: $ git pull https://github.com/bitgaoshu/zookeeper ZOOKEEPER-2836 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/336.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #336 commit 3653e6ddc21589355fb06c98aa60665fce4a4e24 Author: bitgaoshuDate: 2017-08-14T07:02:16Z ZOOKEEPER-2836 fix SocketTimeoutException --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125281#comment-16125281 ] Fangmin Lv commented on ZOOKEEPER-2846: --- [~hanm] The challenge here is that we don't know there is txn missing or it's due to the Epoch change. We need a way to verify the zxid continuous, we're having an intern project to verify the txns integrity, but that won't be available in the near time, my suggestion is turning off the on disk txn sync for now. > Leader follower sync with on disk txns can possibly leads to data > inconsistency > --- > > Key: ZOOKEEPER-2846 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846 > Project: ZooKeeper > Issue Type: Bug > Components: quorum >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Fangmin Lv >Priority: Critical > > On disk txn sync could cause data inconsistency if the current leader just > had a snap sync before it became leader, and then having diff sync with its > followers may synced the txns gap on disk. Here is scenario: > Let's say S0 - S3 are followers, and S4 is leader at the beginning: > 1. Stop S2 and send one more request > 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync > with S4 when it started up > 3. Stop S4 and S3 became the new leader > 4. Start S2 and had a diff sync with S3, now there are gaps in S2 > Attached the test case to verify the issue. Currently, there is no efficient > way to check the gap in txn files is a real gap or due to Epoch change. We > need to add that support, but before that, it would be safer to disable the > on disk txn leader-follower sync. > Another two scenarios which could cause the same issue: > (Scenario 1) Server A, B, C, A is leader, the others are followers: > 1). A synced to disk, but the other 2 restarted before receiving the > proposal > 2). B and C formed quorum, B is leader, and committed some requests > 3). A looking again, and sync with B, B won't able to trunc A but send snap > instead, and leaves the extra txn in A's txn file > 4). A became new leader, and someone else has a diff sync with A it will > have the extra txn > (Scenario 2) Diff sync with committed txn, will only apply to data tree but > not on disk txn file, which will also leave hole in it and lead to data > inconsistency issue when syncing with learners. -- This message was sent by Atlassian JIRA (v6.4.14#64029)