Andreas Weber created ZOOKEEPER-4444: ----------------------------------------
Summary: Follower doesn't get synchronized after process restart Key: ZOOKEEPER-4444 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4444 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.6.3 Reporter: Andreas Weber Hi folks, I've got an issue with 3.6.3. I can't provide a simple test, because it seems to depend on timing in a cluster environment, but I tried to reduce the scenario as far as possible: * zookeeper cluster with 5 nodes, all of them Followers (no Observers) * start some parallel threads which do some writes in the cluster (e.g. create/delete znodes) * kill one of the zookeeper processes, let's say on node X (where node X is not the Leader) * restart zookeeper process on node X * wait a few seconds * kill zookeeper process on node X again * restart zookeeper process on node X again In some cases (every 3-4 runs) I see the following in the log of node X: After first restart of node X: {noformat} WARN persistence.FileTxnLog - Current zxid 4294968525 is <= 4294969524 for 15 WARN persistence.FileTxnLog - Current zxid 4294968526 is <= 4294969524 for 15 WARN persistence.FileTxnLog - Current zxid 4294968527 is <= 4294969524 for 15 ... (this kind of WARN is repeated some hundred times) WARN quorum.SendAckRequestProcessor - Closing connection to leader, exception during packet send java.net.SocketException: Socket closed ... ... (this kind of WARN is repeated some hundred times) {noformat} After second restart of node X: {noformat} ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > 4294969147(next log) for type 2 WARN server.DataTree - Message:Digests are not matching. Value is Zxid. Value:4294969147 ERROR server.DataTree - First digest mismatch on txn: 360466402305310720,3870,4294969147,1639258399998,2 , ... , expected digest is 2,1365261838770 , actual digest is 1098406565142, ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > 4294969148(next log) for type 2 ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > 4294969149(next log) for type 5 ERROR persistence.FileTxnSnapLog - 4294970146(highestZxid) > 4294969150(next log) for type 2 ... (this kind of ERROR is repeated some hundred times) {noformat} And afterwards (in the actual application), zookepeer on node X seems to have a different view of the cluster state and doesn't get synchronized, at least for a few hours. This e.g. leads to phantom reads of znodes that were deleted a long time ago. (The resulting behaviour looks a little bit similar as described in ZOOKEEPER-3911.) This does not happen with zookeeper 3.6.2 ! (at least I can't reproduce it with this version) -- This message was sent by Atlassian Jira (v8.20.1#820001)