Andreas Weber created ZOOKEEPER-4444:
----------------------------------------

             Summary: Follower doesn't get synchronized after process restart
                 Key: ZOOKEEPER-4444
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4444
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.6.3
            Reporter: Andreas Weber


Hi folks, I've got an issue with 3.6.3.
I can't provide a simple test, because it seems to depend on timing in a 
cluster environment, but I tried to reduce the scenario as far as possible:
 * zookeeper cluster with 5 nodes, all of them Followers (no Observers)
 * start some parallel threads which do some writes in the cluster (e.g. 
create/delete znodes)
 * kill one of the zookeeper processes, let's say on node X (where node X is 
not the Leader)
 * restart zookeeper process on node X
 * wait a few seconds
 * kill zookeeper process on node X again
 * restart zookeeper process on node X again

In some cases (every 3-4 runs) I see the following in the log of node X:

After first restart of node X:
{noformat}
 WARN  persistence.FileTxnLog           - Current zxid 4294968525 is <= 
4294969524 for 15
 WARN  persistence.FileTxnLog           - Current zxid 4294968526 is <= 
4294969524 for 15
 WARN  persistence.FileTxnLog           - Current zxid 4294968527 is <= 
4294969524 for 15
 ... (this kind of WARN is repeated some hundred times)
 WARN  quorum.SendAckRequestProcessor   - Closing connection to leader, 
exception during packet send java.net.SocketException: Socket closed ...
 ... (this kind of WARN is repeated some hundred times)
{noformat}
After second restart of node X:
{noformat}
 ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
4294969147(next log) for type 2
 WARN  server.DataTree                  - Message:Digests are not matching. 
Value is Zxid. Value:4294969147
 ERROR server.DataTree                  - First digest mismatch on txn: 
360466402305310720,3870,4294969147,1639258399998,2
, ...
, expected digest is 2,1365261838770
, actual digest is 1098406565142, 
 ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
4294969148(next log) for type 2
 ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
4294969149(next log) for type 5
 ERROR persistence.FileTxnSnapLog       - 4294970146(highestZxid) > 
4294969150(next log) for type 2
 ... (this kind of ERROR is repeated some hundred times)
{noformat}
And afterwards (in the actual application), zookepeer on node X seems to have a 
different view of the cluster state and doesn't get synchronized, at least for 
a few hours.
This e.g. leads to phantom reads of znodes that were deleted a long time ago.
(The resulting behaviour looks a little bit similar as described in 
ZOOKEEPER-3911.)

This does not happen with zookeeper 3.6.2 !
(at least I can't reproduce it with this version)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to