[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110357#comment-17110357
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3829:
---------------------------------------------

[~keliwang] this is a very nice catch! I was also validating your find from the 
other direction: If I add the {{localSessionsEnabled=true}} to the config just 
sent by [~sundyli], the zkCli was not hanging (while using his config I 
reproduced the original issue). So having this config caused that I was unable 
to reproduce the issue in the first place.

The {{localSessionsEnabled=true}} matters only because when the local sessions 
are enabled, then the client will be able to connect without having it's global 
session ID committed. The basic problem is indeed with this log line, as you 
wrote in ZOOKEEPER-3830:

{code:java}
2020-05-18 14:08:07,051 [myid:4] - INFO  
[QuorumPeer[myid=4](plain=/0.0.0.0:2181)(secure=disabled):Leader@1296] - Have 
quorum of supporters, sids: [ [4, 1, 3],[1, 3] ]; starting up and setting last 
processed zxid: 0x400000000
{code}

This will lead that the designated leader will not be the new leader ([see 
here|https://github.com/apache/zookeeper/blob/c11b7e26bc554b8523dc929761dd28808913f091/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1310]),
 and {{allowedToCommit = false}} will be set a few lines later. But the no new 
leader election will be started, as the dynamic reconfig is not enabled.

So I think the solution would be to skip the whole {{designatedLeader}} check 
when the dynamicReconfig is disabled.


[~sundyli], I was adding some debug logs to the commit processor and found that 
when the "{{Configuring CommitProcessor with XX worker threads}}" log is 
printed out, we always creating a new {{workerPool}}. I am not sure how 
resetting the {{workerPool}} to null would solve this issue. Are you sure that 
it helps in your case? Maybe then we are chasing here a different error :) - 
The one I just reproduced with your zoo.cfg in docker compose seems to be 
unrelated to {{workerPool}} but related to {{lastSeenQuorumVerifier}} and 
dynamic-reconfig.

I will create a PR now with the proposed fix (skipping the checking of the  
{{designatedLeader}} when dynamic-reconfig is disabled), but I need some time 
first to check if the same error affects the master branch and also to see if I 
can add some unit tests for this.

> Zookeeper refuses request after node expansion
> ----------------------------------------------
>
>                 Key: ZOOKEEPER-3829
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.6
>            Reporter: benwang li
>            Priority: Major
>         Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to