Alex Mirgorodskiy created ZOOKEEPER-3591: --------------------------------------------
Summary: Inconsistent resync with dynamic reconfig Key: ZOOKEEPER-3591 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3591 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.5 Reporter: Alex Mirgorodskiy Attachments: instance1.log.gz, instance6.log.gz We've run into a problem where one of the zookeeper instances lost most of its data after its zk process has been restarted. We suspect an interaction between dynamic reconfiguration and snapshot-based resync of that instance. Details and some amateurish analysis are below. We can also upload transaction logs, if need be. We have a 6-instance ensemble running version 3.5.5 with 3 quorum members and 3 observers. One of the observers (Instance 6) saw its db shrink from 3162 znodes down to 10 after that instance restarted: > 2019-10-13T16:44:19.060+0000 [.zk-monitor-0] Monitor command mntr: zk_version > 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT > zk_znode_count 3162 > -- > 2019-10-13T16:48:32.713+0000 [.zk-monitor-0] Monitor command mntr: zk_version > 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT > zk_znode_count 10 Contrast it with Instance 1 that was the leader at the time, and whose znode_count remained stable around 3000: > 2019-10-13T16:44:48.625+0000 [.zk-monitor-0] Monitor command mntr: zk_version > 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT > zk_znode_count 3178 > -- > ... > -- > 2019-10-13T16:48:48.731+0000 [.zk-monitor-0] Monitor command mntr: zk_version > 3.5.5-afd10a8846b22a34c5a818034bb22e99dd44587b, built on 09/16/2019 18:31 GMT > zk_znode_count 3223 It appears that the problem had happened 30 minutes earlier, when Instance 6 got resynced from the leader via the Snap method, yet skipped creating an on-disk snapshot. The end result was that the in-memory state was fine, but there was only the primordial snapshot.0 on disk, and transaction logs only started after the missing snapshot: $ ls -l version-2 > total 1766 > -rw-r--r-- 1 daautomation daautomation 1 Oct 13 09:14 acceptedEpoch > -rw-r--r-- 1 daautomation daautomation 1 Oct 13 10:12 currentEpoch > -rw-r--r-- 1 daautomation daautomation 2097168 Oct 13 09:44 log.6000002e0 > -rw-r--r-- 1 daautomation daautomation 1048592 Oct 13 10:09 log.600001f1b > -rw-r--r-- 1 daautomation daautomation 4194320 Oct 13 12:16 log.600003310 > -rw-r--r-- 1 daautomation daautomation 770 Oct 13 09:14 snapshot.0 So the zk reboot wiped out most of the state. Dynamic reconfig might be relevant here. Instance 6 started as an observer, got removed, and immediately re-added as a participant. Instance 2 went the other way, from participant to observer: > 2019-10-13T16:14:19.323+0000 ZK reconfig: removing node 6 > 2019-10-13T16:14:19.359+0000 ZK reconfig: adding > server.6=10.80.209.138:2888:3888:participant;0.0.0.0:2181 > 2019-10-13T16:14:19.399+0000 ZK reconfig: adding > server.2=10.80.209.131:2888:3888:observer;0.0.0.0:2181 Looking at the logs, Instance 6 started and received a resync snapshot from the leader right before the dynamic reconfig: > 2019-10-13T16:14:19.284+0000 > [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Getting a > snapshot from leader 0x6000002dd > ... > 2019-10-13T16:14:19.401+0000 > [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Got zxid > 0x6000002de expected 0x1 Had it processed the NEWLEADER packet afterwards, it would've persisted the snapshot locally. But there's no NEWLEADER message in the Instance 6 log. Instead, there's a "changes proposed in reconfig" exception, likely a result of the instance getting dynamically removed and re-added as a participant: > 2019-10-13T16:14:19.467+0000 > [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Becoming a > non-voting participant > 2019-10-13T16:14:19.467+0000 > [.QuorumPeer[myid=6](plain=/0.0.0.0:2181)(secure=disabled)] Exception when > observing the leaderjava.lang.Exception: changes proposed in reconfig\n\tat > org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:506)\n\tat > > org.apache.zookeeper.server.quorum.Observer.observeLeader(Observer.java:74)\n\tat > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1258) Perhaps the NEWLEADER packet was still in the socket, but sitting behing INFORMANDACTIVATE, whose exception prevented us from processing NEWLEADER? Also, it may or may not be related, but this area got changed recently as part of https://issues.apache.org/jira/browse/ZOOKEEPER-3104. -- This message was sent by Atlassian Jira (v8.3.4#803005)