[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-15 Thread Alexander Shraer (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108866#comment-17108866
 ] 

Alexander Shraer commented on ZOOKEEPER-3814:
-

[~symat] the way this works is the usually, config changes happen in two rounds 
- a proposal sets {{lastSeenQuorumVerifier, which writes the .next file, but 
then a commit calls processReconfig which calls setQuorumVerifier. Same happens 
when a learner syncs with leader - the leader's proposal is now NEW_LEADER and 
the leader's commit is UPTODATE. The commit / UPTODATE is the thing actually 
changing the config, not }}{{lastSeenQuorumVerifier (though writing out .next 
files should also be prevented in this case, I think)}}{{. Another change where 
a config could change is during the gossip happening in leader election - 
servers send around their configs peer-to-peer, and update their config to a 
later one if they see one (FastLeaderElection.java, look for processReconfig). 
There too, you could require that the reconfigEnable flag is on before calling 
processReconfig.}}

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManage

[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-15 Thread Rajkiran Sura (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108856#comment-17108856
 ] 

Rajkiran Sura commented on ZOOKEEPER-3814:
--

{quote}but in the meanwhile I recommend using dynamic reconfig to change the 
quorum.
{quote}
Yes, we started to rely on dynamic-reconfig. But, I would like to note that 
dynamic-reconfig isn't really dynamic when you have Quorum auth enabled with 
GSSAPI via SASL. i.e., the config is changed but the new member doesn't join 
the ensemble until all the members are restarted. Thus, its no more dynamic. 
Looks more scarier.

FTR: I have raised https://issues.apache.org/jira/browse/ZOOKEEPER-3824 for 
this issue.

Thanks Mate.

Regards,

Rajkiran

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,0

[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator

2020-05-15 Thread Jordan Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108540#comment-17108540
 ] 

Jordan Zimmerman commented on ZOOKEEPER-3831:
-

I'm excluding zookeeper in Maven and this will only be in the test path so it 
shouldn't pollute ZooKeeper's classpath. 

> Add a test that does a minimal validation of Apache Curator
> ---
>
> Key: ZOOKEEPER-3831
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: tests
>Affects Versions: 3.6.1
>Reporter: Jordan Zimmerman
>Assignee: Jordan Zimmerman
>Priority: Minor
>
> Given that Apache Curator is one of the most widely used ZooKeeper clients it 
> would be beneficial for ZooKeeper to have a minimal test to ensure that the 
> codebase doesn't cause incompatibilities with Curator in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator

2020-05-15 Thread Jordan Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108540#comment-17108540
 ] 

Jordan Zimmerman edited comment on ZOOKEEPER-3831 at 5/15/20, 6:35 PM:
---

I'm excluding zookeeper in Maven and this will only be in the test path so it 
shouldn't pollute ZooKeeper's classpath. But, maybe a "compatability" module is 
in order?


was (Author: randgalt):
I'm excluding zookeeper in Maven and this will only be in the test path so it 
shouldn't pollute ZooKeeper's classpath. 

> Add a test that does a minimal validation of Apache Curator
> ---
>
> Key: ZOOKEEPER-3831
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: tests
>Affects Versions: 3.6.1
>Reporter: Jordan Zimmerman
>Assignee: Jordan Zimmerman
>Priority: Minor
>
> Given that Apache Curator is one of the most widely used ZooKeeper clients it 
> would be beneficial for ZooKeeper to have a minimal test to ensure that the 
> codebase doesn't cause incompatibilities with Curator in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator

2020-05-15 Thread Enrico Olivelli (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108539#comment-17108539
 ] 

Enrico Olivelli commented on ZOOKEEPER-3831:


Very interesting.
I think this should stay in a separate module under Zookeeper its module.
In order not to have a polluted classpath

> Add a test that does a minimal validation of Apache Curator
> ---
>
> Key: ZOOKEEPER-3831
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: tests
>Affects Versions: 3.6.1
>Reporter: Jordan Zimmerman
>Assignee: Jordan Zimmerman
>Priority: Minor
>
> Given that Apache Curator is one of the most widely used ZooKeeper clients it 
> would be beneficial for ZooKeeper to have a minimal test to ensure that the 
> codebase doesn't cause incompatibilities with Curator in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator

2020-05-15 Thread Jordan Zimmerman (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108525#comment-17108525
 ] 

Jordan Zimmerman commented on ZOOKEEPER-3831:
-

I have a PR near ready. We just need to release a new version of Curator.

> Add a test that does a minimal validation of Apache Curator
> ---
>
> Key: ZOOKEEPER-3831
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: tests
>Affects Versions: 3.6.1
>Reporter: Jordan Zimmerman
>Assignee: Jordan Zimmerman
>Priority: Minor
>
> Given that Apache Curator is one of the most widely used ZooKeeper clients it 
> would be beneficial for ZooKeeper to have a minimal test to ensure that the 
> codebase doesn't cause incompatibilities with Curator in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3831) Add a test that does a minimal validation of Apache Curator

2020-05-15 Thread Jordan Zimmerman (Jira)
Jordan Zimmerman created ZOOKEEPER-3831:
---

 Summary: Add a test that does a minimal validation of Apache 
Curator
 Key: ZOOKEEPER-3831
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3831
 Project: ZooKeeper
  Issue Type: Improvement
  Components: tests
Affects Versions: 3.6.1
Reporter: Jordan Zimmerman
Assignee: Jordan Zimmerman


Given that Apache Curator is one of the most widely used ZooKeeper clients it 
would be beneficial for ZooKeeper to have a minimal test to ensure that the 
codebase doesn't cause incompatibilities with Curator in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3830) After add a new node, zookeeper cluster won't commit any proposal if this new node is leader

2020-05-15 Thread Keli Wang (Jira)
Keli Wang created ZOOKEEPER-3830:


 Summary: After add a new node, zookeeper cluster won't commit any 
proposal if this new node is leader
 Key: ZOOKEEPER-3830
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3830
 Project: ZooKeeper
  Issue Type: Bug
 Environment: Zookeeper 3.5.8

JDK 1.8
Reporter: Keli Wang
 Attachments: reproduce-zkclusters.tar.gz

I have a zookeeper cluster with 3 nodes, node3 is the leader of the cluster.

 
{code:java}
server.1=node1
server.2=node2
server.3=node3 # current leader{code}
With dynamic reconfiguration disabled, I scale this cluster to 4 nodes with 2 
steps:
 # Start node4 with new config, now node4 is a follower.
 # Modify config and restart node1, node2 and node3 one by one.

The new cluster config is:
{code:java}
server.1=node1
server.2=node2
server.3=node3 
server.4=node4 # current leader
{code}
After restart, node4 is the leader of this cluster. But I cannot connect to 
this cluster using zkCli now.

If I restart node4, node3 will be the new leader, and now I can connect to 
cluster using zkCli again.

After some digging, I find node4's Leader#allowedToCommit field is false, so 
this cluster won't commit any new proposals.

 

I have attached a zookeeper cluster to reproduce this problem. The cluster in 
the attachment can run in one single machine.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108373#comment-17108373
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3814:
-

unfortunately I haven't found any trivial fixes yet. I will try more approaches 
next week, but in the meanwhile I recommend using dynamic reconfig to change 
the quorum.

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{

[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046
 ] 

benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 2:56 PM:
-

We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will not work anymore. 
I submit the node D logs attachment `d.log`, and we can see that happens.
{code:java}
307:  2020-05-14 18:04:12,022 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362]
 - Shutting down 
308 2020-05-14 18:04:12,022 [myid:4] - INFO  
[FollowerRequestProcessor:4:FollowerRequestProcessor@110] - 
FollowerRequestProcessor exited loop!
309 2020-05-14 18:04:12,022 [myid:4] - INFO  
[CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop!
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
311 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655]
 - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b
312 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658]
 - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b
313 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696]
 - EOF exception java.io.EOFException: Failed to read 
/data1/zookeeper/logs/version-2/log.2a000b
314 --
315 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x3082f5048fc
316 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40002
317 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f4
318 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40001
319 2020-05-14 18:04:29,000 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
320 2020-05-14 18:04:29,002 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64]
 - Using checkIntervalMs=6 maxPerMinute=1
321 2020-05-14 18:04:29,003 [myid:4] - DEBUG 
[LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE 
message to 3
{code}


was (Author: sundyli):
We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will not work anymore.  
I  submit the node D logs attachment `d.log`, and we can see that happens.

 
{code:java}
 308 2020-05-14 18:04:12,022 [myid:4] - INFO  
[FollowerRequestProcessor:4:FollowerRequestProcessor@110] - 
FollowerRequestProcessor exited loop!
309 2020-05-14 18:04:12,022 [myid:4] - INFO  
[CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop!
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
311 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655]
 - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b
312 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658]
 - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b
313 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0

[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108370#comment-17108370
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

{quote}Hi, I reproduced it with your docker-compose scripts
{quote}
great, thanks for the detailed steps!
I will try them locally on Monday and I can verify your findings. (I used 
slightly different docker-compose commands, maybe those made the difference.)

The config looks OK, except the {{initLimit}}, which should be way smaller. It 
should be given in number of ticks instead of millisecs. But I don't think it 
should matter much in this case.

Thanks for taking so much time chasing a ZooKeeper error! :)

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108368#comment-17108368
 ] 

benwang li commented on ZOOKEEPER-3829:
---

Did you try your proposed fix already and saw that it solves your original 
issue?

-
Sorry, I forgot to answer this.
 Yes, I fixed it and tested, it works normally after the fix.

 

Have you checked my reply message? (I reproduced it with your docker repo).
It must be some configuration that makes this issue happen. I will try to find 
which config is wrong.

 

 

 

 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108353#comment-17108353
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

I see now :)
(you pasted the logs from line 308 in your previous comment, this is why I 
missed it)

But still I don't know if it the two log lines are produced by the same 
CommitProcessor object instance. As I said, my understanding is that a new 
ZooKeeperServer with a new clean CommitProcessor is getting created after each 
leader election.

What I see in Line 225 is that this server was (re)started? Then at least a 
whole leader election is missing from the log. Then I see that the server 
become a Follower. 
Then in line 299 it can not follow the current leader anymore. I guess then 
happens a new leader election, missing from the logs. But we see that the 
LearnerZooKeeperServer is shutting down (also closing the CommitProcessor).

And then the next thing I see is what you are mentioning: "Configuring 
CommitProcessor with 24 worker threads". But this time the server is already a 
leader, as it is sending the UPTODATE messages (lines 321, 322). So my 
assumption would be that this time this CommitProcessor is inside a 
LederZooKeeperServer, not inside a LearnerZooKeeperServer. So these are 
actually different CommitProcessors and different workerPools.

Anyway, I am not saying you are not right (this is a quite complicated piece of 
code). All I say is that I am not convinced yet and it is very hard for me to 
tell what is happening, as I don't see the full logs and also I was not able to 
reproduce the problem locally. (maybe my mistake, I don't know) I don't think 
it is related to docker compose vs. plain docker. 

Based on your description, something must have been stucked, I am just not sure 
if it is the workerPool in the CommitProcessor. Can I ask again: "Did you try 
your proposed fix already and saw that it solves your original issue?"
(you can download the ZooKeeper code then apply your fix and run `mvn clean 
install -DskipTests` and change the zookeeper jar files in the docker image for 
testing)

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108338#comment-17108338
 ] 

benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 2:30 PM:
-

[~symat] 

Hi, I reproduced it with your docker-compose scripts.

my zoo.cfg as ClickHouse documents 
[tips|[https://clickhouse.tech/docs/en/operations/tips/]]

 
{code:java}
libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg
dataDir=/data
dataLogDir=/datalog

tickTime=2000
initLimit=3
syncLimit=10
maxClientCnxns=2000
maxSessionTimeout=6000
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
preAllocSize=131072
snapCount=300
leaderServes=yes
standaloneEnabled=false
clientPort=2181
admin.serverPort=8084
 
{code}
Scripts

 
{code:java}
export ZOOKEEPER_GIT_REPO=~/git/zookeeper
export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test
# you always need to do a maven install to have the assembly tar.gz file 
updated!
cd $ZOOKEEPER_GIT_REPO
mvn clean install -DskipTests
cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO
sudo rm -rf data
docker-compose --project-name zookeeper --file 
3_nodes_zk_mounted_data_folder.yml up -d
docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create 
/clickhouse aaa
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo4
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo4

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo1
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo2
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo3
# This hangs
docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls /

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml down
 
{code}
 


was (Author: sundyli):
[~symat] 

Hi, I reproduced it with your docker-compose scripts.

my zoo.cfg

 
{code:java}
libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg
dataDir=/data
dataLogDir=/datalog

tickTime=2000
initLimit=3
syncLimit=10
maxClientCnxns=2000
maxSessionTimeout=6000
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
preAllocSize=131072
snapCount=300
leaderServes=yes
standaloneEnabled=false
clientPort=2181
admin.serverPort=8084
 
{code}
Scripts

 
{code:java}
export ZOOKEEPER_GIT_REPO=~/git/zookeeper
export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test
# you always need to do a maven install to have the assembly tar.gz file 
updated!
cd $ZOOKEEPER_GIT_REPO
mvn clean install -DskipTests
cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO
sudo rm -rf data
docker-compose --project-name zookeeper --file 
3_nodes_zk_mounted_data_folder.yml up -d
docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create 
/clickhouse aaa
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo4
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo4

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo1
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo2
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo3
# This hangs
docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls /

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml down
 
{code}
 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3

[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108338#comment-17108338
 ] 

benwang li commented on ZOOKEEPER-3829:
---

[~symat] 

Hi, I reproduced it with your docker-compose scripts.

my zoo.cfg

 
{code:java}
libenwang@ck015:~/git/zookeeper-docker-test$ cat conf/zoo.cfg
dataDir=/data
dataLogDir=/datalog

tickTime=2000
initLimit=3
syncLimit=10
maxClientCnxns=2000
maxSessionTimeout=6000
autopurge.snapRetainCount=10
autopurge.purgeInterval=1
preAllocSize=131072
snapCount=300
leaderServes=yes
standaloneEnabled=false
clientPort=2181
admin.serverPort=8084
 
{code}
Scripts

 
{code:java}
export ZOOKEEPER_GIT_REPO=~/git/zookeeper
export ZOOKEEPER_DOCKER_TEST_GIT_REPO=~/git/zookeeper-docker-test
# you always need to do a maven install to have the assembly tar.gz file 
updated!
cd $ZOOKEEPER_GIT_REPO
mvn clean install -DskipTests
cd $ZOOKEEPER_DOCKER_TEST_GIT_REPO
sudo rm -rf data
docker-compose --project-name zookeeper --file 
3_nodes_zk_mounted_data_folder.yml up -d
docker exec -it zookeeper_zoo1_1 /bin/bash /zookeeper/bin/zkCli.sh create 
/clickhouse aaa
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo4
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo4

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo1
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo1
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo2
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo2
 
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml stop zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml create zoo3
docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml start zoo3
# This hangs
docker exec -it zookeeper_zoo4_1 /bin/bash /zookeeper/bin/zkCli.sh ls /

docker-compose --project-name zookeeper --file 
4_nodes_zk_mounted_data_folder.yml down
 
{code}
 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3824) ZooKeeper dynamic reconfig doesn't work with GSSAPI/SASL enabled Quorum authn/z

2020-05-15 Thread Rajkiran Sura (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108266#comment-17108266
 ] 

Rajkiran Sura commented on ZOOKEEPER-3824:
--

Tagging [~symat] [~shralex] [~hanm] [~eolivelli] if they have any thoughts wrt 
this issue.

 

Thanks,

Rajkiran

> ZooKeeper dynamic reconfig doesn't work with GSSAPI/SASL enabled Quorum 
> authn/z
> ---
>
> Key: ZOOKEEPER-3824
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3824
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: kerberos, leaderElection, quorum, server
>Affects Versions: 3.5.6
> Environment: O.S. :- RHEL7
>Reporter: Rajkiran Sura
>Priority: Major
>
> With 'DynamicReconfig' feature in v3.5.6, ideally the servers can be added 
> and removed without restarting ZooKeeper service on any of the nodes.
> But, with Keberos (GSSAPI via SASL) enabled quorum 
> authentication/authorization, this is not possible. Because, when you try to 
> add a new server, it won't be able to connect to any of the members in the 
> ensemble and the data won't be synced. This is because all the members reject 
> it based on authorization. For this to make it work, we need to do 
> 'reconfig', then restart leader, the new member and rest of the members.
> Is this the expected behavior with Quorum-auth + DynamicReconfig? Or am I 
> missing something here.
> This is our basic quorum-auth config:
> {quote}quorum.auth.serverRequireSasl=true
>  quorum.auth.kerberos.servicePrincipal=zookeeper/_HOST
>  quorum.auth.enableSasl=true
>  quorum.auth.learner.saslLoginContext=QuorumLearner
>  quorum.auth.learnerRequireSasl=true
>  quorum.cnxn.threads.size=20
>  quorum.auth.server.saslLoginContext=QuorumServer
> {quote}
> FTR: I raised this question in [ZooKeeper-user 
> forum|http://zookeeper-user.578899.n2.nabble.com/ZooKeeper-dynamic-reconfig-issue-when-Quorum-authn-authz-is-enabled-td7584927.html]
>  and both Mate and Enrico suspect this to be a bug.
> Also this is easily reproducible in a Kerbers (GSSAPI via SASL) enabled 
> quorum based ensemble.
>  
> Regards,
> Rajkiran
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3814) ZooKeeper caching of config

2020-05-15 Thread Rajkiran Sura (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108259#comment-17108259
 ] 

Rajkiran Sura commented on ZOOKEEPER-3814:
--

Many thanks Mate, for looking into this. Glad that you could pin-point the 
problem.

 

Regards,

Rajkiran

> ZooKeeper caching of config
> ---
>
> Key: ZOOKEEPER-3814
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3814
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.6
>Reporter: Rajkiran Sura
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
> We recently upgraded our 5 node ZooKeeper ensemble from v3.4.8 to v3.5.6. 
> Encountered no issues as such.
> This is how the ZooKeeper config looks like:
> {quote}tickTime=2000
> dataDir=/zookeeper-data/
> initLimit=5
> syncLimit=2
> maxClientCnxns=2048
> autopurge.snapRetainCount=3
> autopurge.purgeInterval=1
> 4lw.commands.whitelist=stat, ruok, conf, isro, mntr
> authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
> requireClientAuthScheme=sasl
> quorum.cnxn.threads.size=20
> quorum.auth.enableSasl=true
> quorum.auth.kerberos.servicePrincipal= zookeeper/_HOST
> quorum.auth.learnerRequireSasl=true
> quorum.auth.learner.saslLoginContext=QuorumLearner
> quorum.auth.serverRequireSasl=true
> quorum.auth.server.saslLoginContext=QuorumServer
> server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> server.22=node5.bar.com:2888:3888;2181
> {quote}
> Post upgrade, we had to migrate server.22 on the same node, but with 
> *FOO*.bar.com domain name due to kerberos referral issues. And, we used 
> different server-identifier, i.e., *23* when we migrated. So, here is how the 
> new config looked like:
> {quote}server.17=node1.foo.bar.com:2888:3888;2181
> server.19=node2.foo.bar.com:2888:3888;2181
> server.20=node3.foo.bar.com:2888:3888;2181
> server.21=node4.foo.bar.com:2888:3888;2181
> *server.23=node5.{color:#00875a}foo{color}.bar.com:2888:3888;2181*
> {quote}
> We restarted all the nodes in the ensemble with the above updated config. And 
> the migrated node joined the quorum successfully and was serving all clients 
> directly connected to it, without any issues.
> Recently, when a leader election happened, 
> server.*23*=node5.foo.bar.com(migrated node) was chosen as Leader (as it has 
> highest ID). But then, ZooKeeper was unable to serve any clients and *all* 
> the servers were _somehow still_ trying to establish a channel to 22 (old DNS 
> name: node5.bar.com) and were throwing below error in a loop:
> {quote}{{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumPeer$QuorumServer@196] - Failed to resolve 
> address: node4.bar.com}}
> {{java.net.UnknownHostException: node5.bar.com: Name or service not known}}
> {{ at java.base/java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)}}
> {{ at 
> java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)}}
> {{ at 
> java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1515)}}
> {{ at 
> java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)}}
> {{ at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1364)}}
> {{ at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1298)}}
> {{ at java.base/java.net.InetAddress.getByName(InetAddress.java:1248)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:194)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumPeer.recreateSocketAddresses(QuorumPeer.java:774)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:701)}}
> {{ at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)}}
> {{ at 
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)}}
> {{ at java.base/java.lang.Thread.run(Thread.java:834)}}
> {{2020-05-02 01:43:03,026 [myid:23] - WARN 
> [WorkerSender[myid=23]:QuorumCnxManager@679] - Cannot open channel to 22 at 
> election address node5.bar.com:3888}}
> {{java.net.UnknownHostException: node5.bar.com}}
> {{ at 
> java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)}}
> {{ at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:403)}}
> {{ at java.base/java.net.Socket.connect(Socket.java:591)}}
> {{ at 
> o

[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108258#comment-17108258
 ] 

benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 1:04 PM:
-

 

[~symat]

The logs are in
{code:java}
line 307:  2020-05-14 18:04:12,022 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362]
 - Shutting down

line 319: 2020-05-14 18:04:29,000 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
{code}
Thanks for your feedback, I reproduced it without docker,  I will try to 
reproduce it with docker-compose.

 


was (Author: sundyli):
 

[~symat]

The logs are in
{code:java}
line 307:  2020-05-14 18:04:12,022 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362]
 - Shutting down

line 309: 2020-05-14 18:04:29,000 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
{code}

Thanks for your feedback, I reproduced it without docker,  I will try to 
reproduce it with docker-compose.

 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108258#comment-17108258
 ] 

benwang li commented on ZOOKEEPER-3829:
---

 

[~symat]

The logs are in
{code:java}
line 307:  2020-05-14 18:04:12,022 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@362]
 - Shutting down

line 309: 2020-05-14 18:04:29,000 [myid:4] - INFO 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
{code}

Thanks for your feedback, I reproduced it without docker,  I will try to 
reproduce it with docker-compose.

 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benwang li updated ZOOKEEPER-3829:
--
Attachment: screenshot-1.png

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108113#comment-17108113
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

Did you actually see in the logs this printout ["Shutting 
down"|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L632]
 before the {{start}} method would be called on the same CommitProcessor?

I see this one in your logs:
{code:java}
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
{code}
But this is about shutting down the {{FinalRequestProcessor}} and not the 
{{CommitProcessor}}.
  

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108106#comment-17108106
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
-

I failed to reproduce your case. I created docker compose files 
([https://github.com/symat/zookeeper-docker-test]) and using 3.5.6, I executed 
these steps:
 * start A,B,C with config (A,B,C)
 * start D with config (A,B,C,D)
 * stop A
 * start A with config (A,B,C,D)
 * stop B
 * start B with config (A,B,C,D)
 * stop C
 * start C with config (A,B,C,D)

At the end, everything worked for me just fine, leader was D and all nodes were 
up, forming a quorum (A,B,C,D) and zkCli worked ("{{ls /"}})

 
 There must be some differences between your reproduction and mine. Can you 
please share your zoo.cfg?

My looks like:
{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=5
syncLimit=2
autopurge.snapRetainCount=3
autopurge.purgeInterval=0
maxClientCnxns=60

standaloneEnabled=true
admin.enableServer=true
localSessionsEnabled=true
localSessionsUpgradingEnabled=true

4lw.commands.whitelist=stat, ruok, conf, isro, wchc, wchp, srvr, mntr, cons

clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory

# server host/port config in my case (when I have 4 nodes)
server.1=zoo1:2888:3888;2181
server.2=zoo2:2888:3888;2181
server.3=zoo3:2888:3888;2181
server.4=zoo4:2888:3888;2181
{code}
I checked the log file you uploaded. But I don't really see why you think the 
problem is with CommitProcessor. Maybe I miss something. Is this the full log 
file from your D node?

Also I checked the code. I think the {{CommitProcessor}} class should never be 
reused after a {{shutdown()}} is called. After a new leader election, a new 
{{LeaderZooKeeperServer}} / {{FollowerZooKeeperServer}} / 
{{ObserverZooKeeperServer}} object will be created (depending on the role of 
the given server), with a fresh {{CommitProcessor}} and new {{workerPool}}. So 
AFAICT (based only on a high-level look on the code) it shouldn't really matter 
to set {{workerPool=null}} in the shutdown method. But maybe I just don't 
follow your reasoning, or missed something in the code. Feel free to create a 
PR then we can see what you suggest.

Did you try your proposed fix already and saw that it solves your original 
issue?

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046
 ] 

benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 8:04 AM:
-

We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will not work anymore.  
I  submit the node D logs attachment `d.log`, and we can see that happens.

 
{code:java}
 308 2020-05-14 18:04:12,022 [myid:4] - INFO  
[FollowerRequestProcessor:4:FollowerRequestProcessor@110] - 
FollowerRequestProcessor exited loop!
309 2020-05-14 18:04:12,022 [myid:4] - INFO  
[CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop!
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
311 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655]
 - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b
312 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658]
 - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b
313 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696]
 - EOF exception java.io.EOFException: Failed to read 
/data1/zookeeper/logs/version-2/log.2a000b
314 --
315 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x3082f5048fc
316 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40002
317 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f4
318 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40001
319 2020-05-14 18:04:29,000 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
320 2020-05-14 18:04:29,002 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64]
 - Using checkIntervalMs=6 maxPerMinute=1
321 2020-05-14 18:04:29,003 [myid:4] - DEBUG 
[LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE 
message to 3
{code}




 







was (Author: sundyli):
We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will work anymore.  I  
submit the node D logs attachment `d.log`, and we can see that happens.

 
{code:java}
 308 2020-05-14 18:04:12,022 [myid:4] - INFO  
[FollowerRequestProcessor:4:FollowerRequestProcessor@110] - 
FollowerRequestProcessor exited loop!
309 2020-05-14 18:04:12,022 [myid:4] - INFO  
[CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop!
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
311 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655]
 - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b
312 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658]
 - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b
313 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696]
 - EOF exception java.io.EOFException: Failed to read 
/data1/zookeep

[jira] [Comment Edited] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046
 ] 

benwang li edited comment on ZOOKEEPER-3829 at 5/15/20, 7:55 AM:
-

We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will work anymore.  I  
submit the node D logs attachment `d.log`, and we can see that happens.

 
{code:java}
 308 2020-05-14 18:04:12,022 [myid:4] - INFO  
[FollowerRequestProcessor:4:FollowerRequestProcessor@110] - 
FollowerRequestProcessor exited loop!
309 2020-05-14 18:04:12,022 [myid:4] - INFO  
[CommitProcessor:4:CommitProcessor@195] - CommitProcessor exited loop!
310 2020-05-14 18:04:12,023 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FinalRequestProcessor@514]
 - shutdown of request processor complete
311 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@655]
 - Created new input stream /data1/zookeeper/logs/version-2/log.2a000b
312 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@658]
 - Created new input archive /data1/zookeeper/logs/version-2/log.2a000b
313 2020-05-14 18:04:12,024 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):FileTxnLog$FileTxnIterator@696]
 - EOF exception java.io.EOFException: Failed to read 
/data1/zookeeper/logs/version-2/log.2a000b
314 --
315 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x3082f5048fc
316 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40002
317 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f4
318 2020-05-14 18:04:29,000 [myid:4] - DEBUG 
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):SessionTrackerImpl@274]
 - Adding session 0x40a33f8f3f40001
319 2020-05-14 18:04:29,000 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):CommitProcessor@256]
 - Configuring CommitProcessor with 24 worker threads.
320 2020-05-14 18:04:29,002 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0:0:0:0:0:0:0:0:2183)(secure=disabled):ContainerManager@64]
 - Using checkIntervalMs=6 maxPerMinute=1
321 2020-05-14 18:04:29,003 [myid:4] - DEBUG 
[LearnerHandler-/146.196.79.232:38708:LearnerHandler@534] - Sending UPTODATE 
message to 3
{code}




 







was (Author: sundyli):
We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will work anymore.  I  
will submit the node D logs, and we can see that happens.
 





> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, y

[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benwang li updated ZOOKEEPER-3829:
--
Attachment: d.log

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
> Attachments: d.log
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108046#comment-17108046
 ] 

benwang li commented on ZOOKEEPER-3829:
---

We start `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L455]
 .

We shutdown `CommitProcessor` 
[here|https://github.com/apache/zookeeper/blob/e87bad6774e7269ef21a156aff9dad089ef54794/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/CommitProcessor.java#L637].

But when we call `start` method again, the `workerPool` will work anymore.  I  
will submit the node D logs, and we can see that happens.
 





> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benwang li updated ZOOKEEPER-3829:
--
Description: 
It's easy to reproduce this bug.
{code:java}
//代码占位符
 
Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
now.
Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will be 
D, cluster hangs, but it can accept `mntr` command, other command like `ls /` 
will be blocked.

Step 4. Restart nodes D, cluster state is back to normal now.
 
{code}
 

We have looked into the code of 3.5.6 version, and we found it may be the issue 
of  `workerPool` .

The `CommitProcessor` shutdown and make `workerPool` shutdown, but `workerPool` 
still exists. It will never work anymore, yet the cluster still thinks it's ok.

 

I think the bug may still exist in master branch.

We have tested it in our machines by reset the `workerPool` to null. If it's 
ok, please assign this issue to me, and then I'll create a PR. 

 

 

 

  was:
It's easy to reproduce this bug.
{code:java}
//代码占位符
 
Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
now.
Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will be 
D, cluster hangs.

Step 4. Restart nodes D, cluster state is back to normal now.
 
{code}
 

We have looked into the code of 3.5.6 version, and we found it may be the issue 
of  `workerPool` .

The `CommitProcessor` shutdown and make `workerPool` shutdown, but `workerPool` 
still exists. It will never work anymore, yet the cluster still thinks it's ok.

 

I think the bug may still exist in master branch.

We have tested it in our machines by reset the `workerPool` to null. If it's 
ok, please assign this issue to me, and then I'll create a PR. 

 

 

 


> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ZOOKEEPER-3828) zookeeper CLI client gets connection timeout when thee leader is restarted

2020-05-15 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko reassigned ZOOKEEPER-3828:
---

Assignee: (was: Mate Szalay-Beko)

> zookeeper CLI client gets connection timeout when thee leader is restarted
> --
>
> Key: ZOOKEEPER-3828
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3828
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: java client
>Affects Versions: 3.6.1
>Reporter: Aishwarya Soni
>Priority: Minor
>
> I have configured 5 nodes zookeeper cluster using 3.6.1 version in a docker 
> containerized environment. As a part of some destructive testing, I restarted 
> zookeeper leader. Now, re-election happened and all 5 nodes (containers) are 
> back in good state with new leader. But when I login to one of the container 
> and go inside zk Cli (./zkCli.sh) and run the cmd *ls /* I see below error,
>  {color:#00} {color}
>  *{color:#00}[zk: localhost:2181(CONNECTING) 1]{color}* 
> *{color:#00}[zk: localhost:2181(CONNECTING) 1] ls /{color}*
> *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30001ms for session id 0x0{color}*
> *{color:#00}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30001ms for session 
> id 0x0{color}*
>  *{color:#00}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#00}KeeperErrorCode = ConnectionLoss for /{color}*
> *{color:#00}[zk: localhost:2181(CONNECTING) 2] 2020-05-14 23:48:28,089 
> [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#00}2020-05-14 23:48:28,089 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#00}2020-05-14 23:48:28,090 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:60384, server: 
> localhost/127.0.0.1:2181{color}*
> *{color:#00}2020-05-14 23:48:58,119 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30030ms for session id 0x0{color}*
> *{color:#00}2020-05-14 23:48:58,120 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30030ms for session 
> id 0x0{color}*
>  *{color:#00}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#00}2020-05-14 23:49:00,003 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#00}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:32936, server: 
> localhost/127.0.0.1:2181{color}*
> *{color:#00}2020-05-14 23:49:30,032 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30029ms for session id 0x0{color}*
> *{color:#00}2020-05-14 23:49:30,033 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#00}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, 

[jira] [Commented] (ZOOKEEPER-3829) Zookeeper refuses request after node expansion

2020-05-15 Thread benwang li (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108023#comment-17108023
 ] 

benwang li commented on ZOOKEEPER-3829:
---

[~eolivelli] I think it's the same even on latest release. But I didn't test it 
in the latest release. 

My workmates and I can reproduce it in 3.5.6 version every time, how to create 
a reproducer test? 

> Zookeeper refuses request after node expansion
> --
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.6
>Reporter: benwang li
>Priority: Major
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108025#comment-17108025
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3822:
-

{quote}And from there all servers just report

...

and don't recover.
{quote}
what do you mean by "don't recover"? Were the servers unreachable at this 
point? The exception you pasted only shows that some clients closed the 
connection to the ZooKeeper server.

> Zookeeper 3.6.1 EndOfStreamException
> 
>
> Key: ZOOKEEPER-3822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.6.1
>Reporter: Sebastian Schmitz
>Priority: Critical
> Attachments: kafka.log, kafka_test.log, zookeeper.log, 
> zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP 
> and so causing it to fail in separate networks, like in our docker-setup I 
> updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went 
> smoothly and ran for one day. This night I had a new Update of the 
> environment as we deploy as a whole package of all containers (Kafka, 
> Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with 
> latest ones. In this case, there was no change, the containers were just 
> removed and deployed again. As the config and data of zookeeper is not stored 
> inside the containers that's not a problem but this night it broke the whole 
> clusters of Zookeeper and so also Kafka was down.
>  * zookeeper_node_1 was stopped and the container removed and created again
>  * zookeeper_node_1 starts up and the election takes place
>  * zookeeper_node_2 is elected as leader again
>  * zookeeper_node_2 is stopped and the container removed and created again
>  * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
>  * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN  
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 
> 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - 
> Unexpected exceptionEndOfStreamException: Unable to read additional data from 
> client, it probably closed the socket: address = /z.z.z.z:46060, session = 
> 0x2014386bbde at 
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)  
> at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and 
> starting all the zookeeper-nodes. The cluster in dev is still in that state 
> and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00 
> are attached. The first time in local NZ-time and the second one is UTC. the 
> IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for 
> node_3
> The Kafka-Servers are running on the same machine. Which means that the 
> EndOfStreamEceptions could also be connections from Kafka as I don't think 
> that zookeeper_node_3 establish a session with itself?
>  
> Edit:
>  I just found some interesting log from Test-Environment:
>  zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO  
> [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:42012 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO  
> [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:51506 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  These entried are repeated there before the EndOfStreamException shows up...
>  I found that was set by zookeeper_node_3:
>  zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO  
> [QuorumPeer[myid=3](plain=0.0.0.0:2181)(secure=disabled):Leader@1501] Have 
> quorum of supporters, sids: [[1, 3],[1, 3]]; starting up and setting last 
> processed zxid: 0xc6
>  zookeeper_node_3: 2020-05-07 14:10:12,587 [myid:3] INFO  
> [LearnerHandler-/z.z.z.z:60156:LearnerHandler@800] Synchronizing with Learner 
> sid: 2 maxCommittedLog=0xc528f8 minCommittedLog=0xc52704 
> 

[jira] [Commented] (ZOOKEEPER-3822) Zookeeper 3.6.1 EndOfStreamException

2020-05-15 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108017#comment-17108017
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3822:
-

I haven't go deep into the logs, but I see may errors in the zookeeper_test 
server logs like:

{code:java}
May 08 02:11:15 zookeeper_node_2: 2020-05-07 14:11:15,265 [myid:2] - INFO  
[NIOWorkerThread-2:ZooKeeperServer@1375] - Refusing session request for client 
/z2.z2.z2.z2:51826 as it has seen zxid 0xc6 our last zxid is 
0xc528f8 client must try another server
{code}

It indicates that the server (myid=2) didn't catch up with the leader (myid=3) 
yet. I am not sure if it is a bug, or it is just simply caused by that there 
was some traffic on the cluster and the restarts happened too quickly one after 
another.

Is this case reproducible?

> Zookeeper 3.6.1 EndOfStreamException
> 
>
> Key: ZOOKEEPER-3822
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3822
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.6.1
>Reporter: Sebastian Schmitz
>Priority: Critical
> Attachments: kafka.log, kafka_test.log, zookeeper.log, 
> zookeeper_test.log
>
>
> Hello,
> after Zookeeper 3.6.1 solved the issue with leader-election containing the IP 
> and so causing it to fail in separate networks, like in our docker-setup I 
> updated from 3.4.14 to 3.6.1 in Dev- and Test-Environments. It all went 
> smoothly and ran for one day. This night I had a new Update of the 
> environment as we deploy as a whole package of all containers (Kafka, 
> Zookeeper, Mirrormaker etc.) we also replace the Zookeeper-Containers with 
> latest ones. In this case, there was no change, the containers were just 
> removed and deployed again. As the config and data of zookeeper is not stored 
> inside the containers that's not a problem but this night it broke the whole 
> clusters of Zookeeper and so also Kafka was down.
>  * zookeeper_node_1 was stopped and the container removed and created again
>  * zookeeper_node_1 starts up and the election takes place
>  * zookeeper_node_2 is elected as leader again
>  * zookeeper_node_2 is stopped and the container removed and created again
>  * zookeeper_node_3 is elected as the leader while zookeeper_node_2 is down
>  * zookeeper_node_2 starts up and zookeeper_node_3 remains leader
> And from there all servers just report
> 2020-05-07 14:07:57,187 [myid:3] - WARN  
> [NIOWorkerThread-2:NIOServerCnxn@364] - Unexpected exception2020-05-07 
> 14:07:57,187 [myid:3] - WARN  [NIOWorkerThread-2:NIOServerCnxn@364] - 
> Unexpected exceptionEndOfStreamException: Unable to read additional data from 
> client, it probably closed the socket: address = /z.z.z.z:46060, session = 
> 0x2014386bbde at 
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
>  at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) at 
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>  at 
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)  
> at java.base/java.lang.Thread.run(Unknown Source)
> and don't recover.
> I was able to recover the cluster in Test-Environment by stopping and 
> starting all the zookeeper-nodes. The cluster in dev is still in that state 
> and I'm checking the logs to find out more...
> The full logs of the deployment of Zookeeper and Kafka that started at 02:00 
> are attached. The first time in local NZ-time and the second one is UTC. the 
> IPs I replaced are x.x.x.x for node_1, y.y.y.y for node_2 and z.z.z.z for 
> node_3
> The Kafka-Servers are running on the same machine. Which means that the 
> EndOfStreamEceptions could also be connections from Kafka as I don't think 
> that zookeeper_node_3 establish a session with itself?
>  
> Edit:
>  I just found some interesting log from Test-Environment:
>  zookeeper_node_1: 2020-05-07 14:10:29,418 [myid:1] INFO  
> [NIOWorkerThread-6:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:42012 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  zookeeper_node_2: 2020-05-07 14:10:29,680 [myid:2] INFO  
> [NIOWorkerThread-4:ZooKeeperServer@1375] Refusing session request for client 
> /f.f.f.f:51506 as it has seen zxid 0xc6 our last zxid is 0xc528f8 
> client must try another server
>  These entried are repeated there before the EndOfStreamException shows up...
>  I found that was set by zookeeper_node_3:
>  zookeeper_node_3: 2020-05-07 14:09:44,495 [myid:3] INFO  

[jira] [Resolved] (ZOOKEEPER-3690) Improving leader efficiency via not processing learner's requests in commit processor

2020-05-15 Thread Enrico Olivelli (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Olivelli resolved ZOOKEEPER-3690.

Fix Version/s: 3.7.0
   Resolution: Fixed

committed to master branch
thank you [~lvfangmin]

> Improving leader efficiency via not processing learner's requests in commit 
> processor
> -
>
> Key: ZOOKEEPER-3690
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3690
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Fangmin Lv
>Assignee: Fangmin Lv
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.7.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently, all the requests forwarded from learners will be processed like 
> the locally received requests from leader's clients, which is non-trivial 
> effort and not necessary to process those in CommitProcessor with session 
> queue create/destroy
> To improve the efficiency, we could skip processing those requests in 
> leader's commit processor. Based on the benchmark, this optimization improved 
> around 30% maximum write throughput for read/write mixed workload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)