[ https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108106#comment-17108106 ]
Mate Szalay-Beko commented on ZOOKEEPER-3829: --------------------------------------------- I failed to reproduce your case. I created docker compose files ([https://github.com/symat/zookeeper-docker-test]) and using 3.5.6, I executed these steps: * start A,B,C with config (A,B,C) * start D with config (A,B,C,D) * stop A * start A with config (A,B,C,D) * stop B * start B with config (A,B,C,D) * stop C * start C with config (A,B,C,D) At the end, everything worked for me just fine, leader was D and all nodes were up, forming a quorum (A,B,C,D) and zkCli worked ("{{ls /"}}) There must be some differences between your reproduction and mine. Can you please share your zoo.cfg? My looks like: {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=5 syncLimit=2 autopurge.snapRetainCount=3 autopurge.purgeInterval=0 maxClientCnxns=60 standaloneEnabled=true admin.enableServer=true localSessionsEnabled=true localSessionsUpgradingEnabled=true 4lw.commands.whitelist=stat, ruok, conf, isro, wchc, wchp, srvr, mntr, cons clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory # server host/port config in my case (when I have 4 nodes) server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=zoo3:2888:3888;2181 server.4=zoo4:2888:3888;2181 {code} I checked the log file you uploaded. But I don't really see why you think the problem is with CommitProcessor. Maybe I miss something. Is this the full log file from your D node? Also I checked the code. I think the {{CommitProcessor}} class should never be reused after a {{shutdown()}} is called. After a new leader election, a new {{LeaderZooKeeperServer}} / {{FollowerZooKeeperServer}} / {{ObserverZooKeeperServer}} object will be created (depending on the role of the given server), with a fresh {{CommitProcessor}} and new {{workerPool}}. So AFAICT (based only on a high-level look on the code) it shouldn't really matter to set {{workerPool=null}} in the shutdown method. But maybe I just don't follow your reasoning, or missed something in the code. Feel free to create a PR then we can see what you suggest. Did you try your proposed fix already and saw that it solves your original issue? > Zookeeper refuses request after node expansion > ---------------------------------------------- > > Key: ZOOKEEPER-3829 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.5.6 > Reporter: benwang li > Priority: Major > Attachments: d.log > > > It's easy to reproduce this bug. > {code:java} > //代码占位符 > > Step 1. Deploy 3 nodes A,B,C with configuration A,B,C . > Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok > now. > Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will > be D, cluster hangs, but it can accept `mntr` command, other command like `ls > /` will be blocked. > Step 4. Restart nodes D, cluster state is back to normal now. > > {code} > > We have looked into the code of 3.5.6 version, and we found it may be the > issue of `workerPool` . > The `CommitProcessor` shutdown and make `workerPool` shutdown, but > `workerPool` still exists. It will never work anymore, yet the cluster still > thinks it's ok. > > I think the bug may still exist in master branch. > We have tested it in our machines by reset the `workerPool` to null. If it's > ok, please assign this issue to me, and then I'll create a PR. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)