[jira] [Updated] (ZOOKEEPER-2867) an expired ZK session can be re-established

2017-08-14 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated ZOOKEEPER-2867:
---
Attachment: zk.2.08-02
zk.0.08-02
zk.1.08-02

[~hanm], I am attaching the log from 3 ZK servers on Aug. 2. What happened was 
the following.

1. All 3 ZK servers went down around 23:16 and were restarted around 23:41.

2. Kafka controller, which never went down during that window, was able to 
re-establish its ZK session around 23:41:58.

{code:java}
August 2nd 2017, 23:41:58.499   INFOorg.apache.zookeeper.ClientCnxn Socket 
connection established to 
zookeeper-2.cp14.svc.cluster.local/100.71.124.93:2181, initiating session
{code}

3. Kafka broker 0 (non-controller) went down on 23:17:18 and was restarted on 
23:42:12.

Supposedly, the old ZK session for broker 0 (25cd1e82c110001) should be expired 
after the ZK servers were restarted (but it didn't seem to happen). When will 
the clock for session expiration starts when the ZK cluster was restarted? 
After the ZK cluster has elected a leader?


> an expired ZK session can be re-established
> ---
>
> Key: ZOOKEEPER-2867
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2867
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.10
>Reporter: Jun Rao
> Attachments: zk.0.08-02, zk.0.formatted, zk.1.08-02, zk.1.formatted, 
> zk.2.08-02
>
>
> Not sure if this is a real bug, but I found an instance when a ZK client 
> seems to be able to renew a session already expired by the ZK server.
> From ZK server log, session 25cd1e82c110001 was expired at 22:04:39.
> {code:java}
> June 27th 2017, 22:04:39.000  INFO
> org.apache.zookeeper.server.ZooKeeperServer Expiring session 
> 0x25cd1e82c110001, timeout of 12000ms exceeded
> June 27th 2017, 22:04:39.001  DEBUG   
> org.apache.zookeeper.server.quorum.Leader   Proposing:: 
> sessionid:0x25cd1e82c110001 type:closeSession cxid:0x0 zxid:0x20fc4 
> txntype:-11 reqpath:n/a
> June 27th 2017, 22:04:39.001  INFO
> org.apache.zookeeper.server.PrepRequestProcessorProcessed session 
> termination for sessionid: 0x25cd1e82c110001
> June 27th 2017, 22:04:39.001  DEBUG   
> org.apache.zookeeper.server.quorum.CommitProcessor  Processing request:: 
> sessionid:0x25cd1e82c110001 type:closeSession cxid:0x0 zxid:0x20fc4 
> txntype:-11 reqpath:n/a
> June 27th 2017, 22:05:20.324  INFO
> org.apache.zookeeper.server.quorum.Learner  Revalidating client: 
> 0x25cd1e82c110001
> June 27th 2017, 22:05:20.324  INFO
> org.apache.zookeeper.server.ZooKeeperServer Client attempting to renew 
> session 0x25cd1e82c110001 at /100.96.5.6:47618
> June 27th 2017, 22:05:20.325  INFO
> org.apache.zookeeper.server.ZooKeeperServer Established session 
> 0x25cd1e82c110001 with negotiated timeout 12000 for client /100.96.5.6:47618
> {code}
> From ZK client's log, it was able to renew the expired session on 22:05:20.
> {code:java}
> June 27th 2017, 22:05:18.590  INFOorg.apache.zookeeper.ClientCnxn Client 
> session timed out, have not heard from server in 4485ms for sessionid 
> 0x25cd1e82c110001, closing socket connection and attempting reconnect  0
> June 27th 2017, 22:05:18.590  WARNorg.apache.zookeeper.ClientCnxn Client 
> session timed out, have not heard from server in 4485ms for sessionid 
> 0x25cd1e82c110001  0
> June 27th 2017, 22:05:19.325  WARNorg.apache.zookeeper.ClientCnxn SASL 
> configuration failed: javax.security.auth.login.LoginException: No JAAS 
> configuration section named 'Client' was found in specified JAAS 
> configuration file: '/opt/confluent/etc/kafka/server_jaas.conf'. Will 
> continue connection to Zookeeper server without SASL authentication, if 
> Zookeeper server allows it. 0
> June 27th 2017, 22:05:19.326  INFOorg.apache.zookeeper.ClientCnxn Opening 
> socket connection to server 100.65.188.168/100.65.188.168:2181  0
> June 27th 2017, 22:05:20.324  INFOorg.apache.zookeeper.ClientCnxn Socket 
> connection established to 100.65.188.168/100.65.188.168:2181, initiating 
> session 0
> June 27th 2017, 22:05:20.327  INFOorg.apache.zookeeper.ClientCnxn Session 
> establishment complete on server 100.65.188.168/100.65.188.168:2181, 
> sessionid = 0x25cd1e82c110001, negotiated timeout = 12000  0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems

2017-08-14 Thread Abraham Fine (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abraham Fine reassigned ZOOKEEPER-1360:
---

Assignee: Abraham Fine  (was: Henry Robinson)

> QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
> --
>
> Key: ZOOKEEPER-1360
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.4.2
>Reporter: Henry Robinson
>Assignee: Abraham Fine
> Fix For: 3.5.4, 3.6.0
>
>
> After the apparently valid fix to ZOOKEEPER-1294, 
> testNoLogBeforeLeaderEstablishment is failing for me about one time in four. 
> While I'll investigate whether the patch is 1294 is ultimately to blame, 
> reading the test brought to light a number of issues that appear to be bugs 
> or in need of improvement:
> * As part of QuorumTest, an ensemble is already established by the fixture 
> setup code, but apparently unused by the test which uses QuorumUtil. 
> * The test reads QuorumPeer.leader and QuorumPeer.follower without 
> synchronization, which means that writes to those fields may not be published 
> when we come to read them. 
> * The return value of sem.tryAcquire is never checked.
> * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and 
> inscrutable numbers of iterations through the main loop (e.g. the semaphore 
> blocking the final asserts is released only after the 2th of 5 
> callbacks)
> * The test as a whole takes ~30s to run
> The first three are easy to fix (as part of fixing the second, I intend to 
> hide all members of QuorumPeer behind getters and setters), the fourth and 
> fifth need a slightly deeper understanding of what the test is trying to 
> achieve.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems

2017-08-14 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126316#comment-16126316
 ] 

Henry Robinson commented on ZOOKEEPER-1360:
---

Not at all - haven't looked at this in years!

> QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
> --
>
> Key: ZOOKEEPER-1360
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.4.2
>Reporter: Henry Robinson
>Assignee: Henry Robinson
> Fix For: 3.5.4, 3.6.0
>
>
> After the apparently valid fix to ZOOKEEPER-1294, 
> testNoLogBeforeLeaderEstablishment is failing for me about one time in four. 
> While I'll investigate whether the patch is 1294 is ultimately to blame, 
> reading the test brought to light a number of issues that appear to be bugs 
> or in need of improvement:
> * As part of QuorumTest, an ensemble is already established by the fixture 
> setup code, but apparently unused by the test which uses QuorumUtil. 
> * The test reads QuorumPeer.leader and QuorumPeer.follower without 
> synchronization, which means that writes to those fields may not be published 
> when we come to read them. 
> * The return value of sem.tryAcquire is never checked.
> * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and 
> inscrutable numbers of iterations through the main loop (e.g. the semaphore 
> blocking the final asserts is released only after the 2th of 5 
> callbacks)
> * The test as a whole takes ~30s to run
> The first three are easy to fix (as part of fixing the second, I intend to 
> hide all members of QuorumPeer behind getters and setters), the fourth and 
> fifth need a slightly deeper understanding of what the test is trying to 
> achieve.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-1360) QuorumTest.testNoLogBeforeLeaderEstablishment has several problems

2017-08-14 Thread Abraham Fine (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126314#comment-16126314
 ] 

Abraham Fine commented on ZOOKEEPER-1360:
-

[~henryr] do you mind if I take a look at this?

> QuorumTest.testNoLogBeforeLeaderEstablishment has several problems
> --
>
> Key: ZOOKEEPER-1360
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1360
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 3.4.2
>Reporter: Henry Robinson
>Assignee: Henry Robinson
> Fix For: 3.5.4, 3.6.0
>
>
> After the apparently valid fix to ZOOKEEPER-1294, 
> testNoLogBeforeLeaderEstablishment is failing for me about one time in four. 
> While I'll investigate whether the patch is 1294 is ultimately to blame, 
> reading the test brought to light a number of issues that appear to be bugs 
> or in need of improvement:
> * As part of QuorumTest, an ensemble is already established by the fixture 
> setup code, but apparently unused by the test which uses QuorumUtil. 
> * The test reads QuorumPeer.leader and QuorumPeer.follower without 
> synchronization, which means that writes to those fields may not be published 
> when we come to read them. 
> * The return value of sem.tryAcquire is never checked.
> * The progress of the test is based on ad-hoc timings (25 * 500ms sleeps) and 
> inscrutable numbers of iterations through the main loop (e.g. the semaphore 
> blocking the final asserts is released only after the 2th of 5 
> callbacks)
> * The test as a whole takes ~30s to run
> The first three are easy to fix (as part of fixing the second, I intend to 
> hide all members of QuorumPeer behind getters and setters), the fourth and 
> fifth need a slightly deeper understanding of what the test is trying to 
> achieve.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


ZooKeeper_branch34_openjdk7 - Build # 1610 - Failure

2017-08-14 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch34_openjdk7/1610/

###
## LAST 60 LINES OF THE CONSOLE 
###
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on qnode1 (ubuntu) in workspace 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch34_openjdk7
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url git://git.apache.org/zookeeper.git # timeout=10
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Fetching upstream changes from git://git.apache.org/zookeeper.git
 > git --version # timeout=10
 > git fetch --tags --progress git://git.apache.org/zookeeper.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/branch-3.4^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/branch-3.4^{commit} # timeout=10
Checking out Revision 1f811a6281090e1b24152dc51507aa6a2bdeafe3 
(refs/remotes/origin/branch-3.4)
Commit message: "ZOOKEEPER-2859: Fix CMake build on OS X."
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 1f811a6281090e1b24152dc51507aa6a2bdeafe3
 > git rev-list 1f811a6281090e1b24152dc51507aa6a2bdeafe3 # timeout=10
No emails were triggered.
[ZooKeeper_branch34_openjdk7] $ 
/home/jenkins/tools/ant/apache-ant-1.9.9/bin/ant -Dtest.output=yes 
-Dtest.junit.threads=8 -Dtest.junit.output.format=xml -Djavac.target=1.7 clean 
test-core-java
Error: JAVA_HOME is not defined correctly.
  We cannot execute /usr/lib/jvm/java-7-openjdk-amd64//bin/java
Build step 'Invoke Ant' marked build as failure
Recording test results
ERROR: Step ‘Publish JUnit test result report’ failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
No tests ran.

ZooKeeper_branch35_jdk7 - Build # 1076 - Still Failing

2017-08-14 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch35_jdk7/1076/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 71.76 MB...]
[junit] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit] at java.lang.Thread.run(Thread.java:745)
[junit] 2017-08-14 08:50:21,631 [myid:] - WARN  [New I/O boss 
#3723:ClientCnxnSocketNetty$ZKClientHandler@439] - Exception caught: [id: 
0xd67bb607] EXCEPTION: java.net.ConnectException: Connection refused: 
127.0.0.1/127.0.0.1:19365
[junit] java.net.ConnectException: Connection refused: 
127.0.0.1/127.0.0.1:19365
[junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[junit] at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
[junit] at 
org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
[junit] at 
org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
[junit] at 
org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
[junit] at 
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
[junit] at 
org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
[junit] at 
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[junit] at 
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[junit] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[junit] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[junit] at java.lang.Thread.run(Thread.java:745)
[junit] 2017-08-14 08:50:21,631 [myid:] - INFO  [New I/O boss 
#3723:ClientCnxnSocketNetty@208] - channel is told closing
[junit] 2017-08-14 08:50:21,631 [myid:127.0.0.1:19365] - INFO  
[main-SendThread(127.0.0.1:19365):ClientCnxn$SendThread@1231] - channel for 
sessionid 0x105595b2828 is lost, closing socket connection and attempting 
reconnect
[junit] 2017-08-14 08:50:21,691 [myid:] - INFO  [ProcessThread(sid:0 
cport:19547)::PrepRequestProcessor@611] - Processed session termination for 
sessionid: 0x105596142e1
[junit] 2017-08-14 08:50:21,694 [myid:] - INFO  
[SyncThread:0:MBeanRegistry@128] - Unregister MBean 
[org.apache.ZooKeeperService:name0=StandaloneServer_port19547,name1=Connections,name2=127.0.0.1,name3=0x105596142e1]
[junit] 2017-08-14 08:50:21,695 [myid:] - INFO  
[main:ClientCnxnSocketNetty@208] - channel is told closing
[junit] 2017-08-14 08:50:21,695 [myid:] - INFO  [New I/O worker 
#8334:ClientCnxnSocketNetty$ZKClientHandler@384] - channel is disconnected: 
[id: 0x49c543bb, /127.0.0.1:49258 :> 127.0.0.1/127.0.0.1:19547]
[junit] 2017-08-14 08:50:21,695 [myid:] - INFO  [New I/O worker 
#8334:ClientCnxnSocketNetty@208] - channel is told closing
[junit] 2017-08-14 08:50:21,696 [myid:] - INFO  [main:ZooKeeper@1334] - 
Session: 0x105596142e1 closed
[junit] 2017-08-14 08:50:21,696 [myid:] - INFO  
[main-EventThread:ClientCnxn$EventThread@513] - EventThread shut down for 
session: 0x105596142e1
[junit] 2017-08-14 08:50:21,696 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@82] - Memory used 108951
[junit] 2017-08-14 08:50:21,696 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@87] - Number of threads 988
[junit] 2017-08-14 08:50:21,697 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@102] - FINISHED TEST METHOD 
testWatcherAutoResetWithLocal
[junit] 2017-08-14 08:50:21,697 [myid:] - INFO  [main:ClientBase@586] - 
tearDown starting
[junit] 2017-08-14 08:50:21,698 [myid:] - INFO  [main:ClientBase@556] - 
STOPPING server
[junit] 2017-08-14 08:50:21,698 [myid:] - INFO  
[main:NettyServerCnxnFactory@464] - shutdown called 0.0.0.0/0.0.0.0:19547
[junit] 2017-08-14 08:50:21,703 [myid:] - INFO  [main:ZooKeeperServer@541] 
- shutting down
[junit] 2017-08-14 08:50:21,704 [myid:] - ERROR [main:ZooKeeperServer@505] 
- ZKShutdownHandler is not registered, so ZooKeeper server won't take any 
action on ERROR or SHUTDOWN server state changes
[junit] 2017-08-14 08:50:21,704 [myid:] - INFO  
[main:SessionTrackerImpl@232] - Shutting down
[junit] 2017-08-14 08:50:21,704 [myid:] - INFO  
[main:PrepRequestProcessor@1005] - Shutting down
[junit] 2017-08-14 08:50:21,704 [myid:] - INFO  
[main:SyncRequestProcessor@191] - Shutting down
[junit] 2017-08-14 08:50:21,704 [myid:] - INFO  [ProcessThread(sid:0 
cport:19547)::PrepRequestProcessor@155] - PrepRequestProcessor exited loop!
[junit] 2017-08-14 08:50:21,704 [myid:] - INFO  
[SyncThread:0:SyncRequestProcessor@169] - SyncRequestProcessor exited!
[junit] 2017-08-14 

Success: ZOOKEEPER- PreCommit Build #941

2017-08-14 Thread Apache Jenkins Server
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 75.59 MB...]
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +0 tests included.  The patch appears to be a documentation 
patch that doesn't require tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 3.0.1) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 core tests.  The patch passed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] 1b57c837719cded1d9f7a6e2a8ae9ee3d927b362 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] mv: 
‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess’
 and 
‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-github-pr-build/patchprocess’
 are the same file

BUILD SUCCESSFUL
Total time: 20 minutes 25 seconds
Archiving artifacts
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Recording test results
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
[description-setter] Description set: ZOOKEEPER-2836
Putting comment on the pull request
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Email was triggered for: Success
Sending email for trigger: Success
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException

2017-08-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125339#comment-16125339
 ] 

Hadoop QA commented on ZOOKEEPER-2836:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/941//console

This message is automatically generated.

> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --
>
> Key: ZOOKEEPER-2836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum
>Affects Versions: 3.4.6
> Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version:  3.4.6.2.3.2.0-2950 
>Reporter: Amarjeet Singh
>Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are 
> getting SocketTimeoutException  on our boxes after 49days 17 hours . As per 
> current code there is a 3 times retry and after that it says "_As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: $/$:3888__" , Once server nodes reache this state and 
> we restart or add a new node ,it fails to join cluster and logs 'WARN  
> 

[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125335#comment-16125335
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2836:
---

Github user bitgaoshu commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/334#discussion_r132887998
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
@@ -647,11 +648,11 @@ public void run() {
 numRetries = 0;
 }
 } catch (IOException e) {
-if (shutdown) {
-break;
-}
 LOG.error("Exception while listening", e);
-numRetries++;
+if (!(e instanceof SocketTimeoutException)) {
+numRetries++;
+}
+}finally {
--- End diff --

it's my first time to commit code on github. i open a new  
[pr](https://github.com/apache/zookeeper/pull/336), which has fixed according 
to your opinion.  I am sorry for my inconvenience. 


> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --
>
> Key: ZOOKEEPER-2836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum
>Affects Versions: 3.4.6
> Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version:  3.4.6.2.3.2.0-2950 
>Reporter: Amarjeet Singh
>Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are 
> getting SocketTimeoutException  on our boxes after 49days 17 hours . As per 
> current code there is a 3 times retry and after that it says "_As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: $/$:3888__" , Once server nodes reache this state and 
> we restart or add a new node ,it fails to join cluster and logs 'WARN  
> 

[GitHub] zookeeper pull request #334: ZOOKEEPER-2836 SocketTimeoutException

2017-08-14 Thread bitgaoshu
Github user bitgaoshu commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/334#discussion_r132887998
  
--- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
@@ -647,11 +648,11 @@ public void run() {
 numRetries = 0;
 }
 } catch (IOException e) {
-if (shutdown) {
-break;
-}
 LOG.error("Exception while listening", e);
-numRetries++;
+if (!(e instanceof SocketTimeoutException)) {
+numRetries++;
+}
+}finally {
--- End diff --

it's my first time to commit code on github. i open a new  
[pr](https://github.com/apache/zookeeper/pull/336), which has fixed according 
to your opinion.  I am sorry for my inconvenience. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ZOOKEEPER-2836) QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException

2017-08-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125320#comment-16125320
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2836:
---

GitHub user bitgaoshu opened a pull request:

https://github.com/apache/zookeeper/pull/336

ZOOKEEPER-2836 fix SocketTimeoutException



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bitgaoshu/zookeeper ZOOKEEPER-2836

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #336


commit 3653e6ddc21589355fb06c98aa60665fce4a4e24
Author: bitgaoshu 
Date:   2017-08-14T07:02:16Z

ZOOKEEPER-2836 fix SocketTimeoutException




> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --
>
> Key: ZOOKEEPER-2836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum
>Affects Versions: 3.4.6
> Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version:  3.4.6.2.3.2.0-2950 
>Reporter: Amarjeet Singh
>Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are 
> getting SocketTimeoutException  on our boxes after 49days 17 hours . As per 
> current code there is a 3 times retry and after that it says "_As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: $/$:3888__" , Once server nodes reache this state and 
> we restart or add a new node ,it fails to join cluster and logs 'WARN  
> 

[GitHub] zookeeper pull request #336: ZOOKEEPER-2836 fix SocketTimeoutException

2017-08-14 Thread bitgaoshu
GitHub user bitgaoshu opened a pull request:

https://github.com/apache/zookeeper/pull/336

ZOOKEEPER-2836 fix SocketTimeoutException



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/bitgaoshu/zookeeper ZOOKEEPER-2836

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/336.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #336


commit 3653e6ddc21589355fb06c98aa60665fce4a4e24
Author: bitgaoshu 
Date:   2017-08-14T07:02:16Z

ZOOKEEPER-2836 fix SocketTimeoutException




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ZOOKEEPER-2846) Leader follower sync with on disk txns can possibly leads to data inconsistency

2017-08-14 Thread Fangmin Lv (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125281#comment-16125281
 ] 

Fangmin Lv commented on ZOOKEEPER-2846:
---

[~hanm] The challenge here is that we don't know there is txn missing or it's 
due to the Epoch change. We need a way to verify the zxid continuous, we're 
having an intern project to verify the txns integrity, but that won't be 
available in the near time, my suggestion is turning off the on disk txn sync 
for now. 

> Leader follower sync with on disk txns can possibly leads to data 
> inconsistency
> ---
>
> Key: ZOOKEEPER-2846
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2846
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: quorum
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Fangmin Lv
>Priority: Critical
>
> On disk txn sync could cause data inconsistency if the current leader just 
> had a snap sync before it became leader, and then having diff sync with its 
> followers may synced the txns gap on disk. Here is scenario: 
> Let's say S0 - S3 are followers, and S4 is leader at the beginning:
> 1. Stop S2 and send one more request
> 2. Stop S3 and send more requests to the quorum to let S3 have a snap sync 
> with S4 when it started up
> 3. Stop S4 and S3 became the new leader
> 4. Start S2 and had a diff sync with S3, now there are gaps in S2
> Attached the test case to verify the issue. Currently, there is no efficient 
> way to check the gap in txn files is a real gap or due to Epoch change. We 
> need to add that support, but before that, it would be safer to disable the 
> on disk txn leader-follower sync.
> Another two scenarios which could cause the same issue:
> (Scenario 1) Server A, B, C, A is leader, the others are followers:
>   1). A synced to disk, but the other 2 restarted before receiving the 
> proposal
>   2). B and C formed quorum, B is leader, and committed some requests
>   3). A looking again, and sync with B, B won't able to trunc A but send snap 
> instead, and leaves the extra txn in A's txn file
>   4). A became new leader, and someone else has a diff sync with A it will 
> have the extra txn 
> (Scenario 2) Diff sync with committed txn, will only apply to data tree but 
> not on disk txn file, which will also leave hole in it and lead to data 
> inconsistency issue when syncing with learners.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)