[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608752#comment-17608752
 ] 

Jon Marius Venstad commented on ZOOKEEPER-4541:
-----------------------------------------------

After some thorough investigation, we have a good idea of what has happened. 
There are two faults in the ZK codebase, both of which are attempted fixed in 
[https://github.com/apache/zookeeper/pull/1925]. The details of our findings 
are as follows:

 

First, some background: we do rolling restarts of embedded ZK clusters of size 
3, roughly 50 times a day, across many data centres. 

A few times a week, we find a ZK server with a different view of the world, 
from what the other servers in the cluster have. 

As far as we could ascertain, this always happens as a result of the leader 
restarting, that is, the enclosing JVM shuts down (cleanly, and we call 
{{{}QuorumPeer.shutdown(){}}}), and start up again a few minutes later. 

Looking at the data directories of the servers after a recent episode, we found 
two interesting things, of which the second is the critical:

Transaction logs had segments of transactions that were repeated, both 2 and 3 
times. This was found on all servers, but it seems to be harmless, despite the 
ERROR logging it produces when the data directory is loaded on ZK server start. 
It causes a digest mismatch, which isn't acted upon. It also becomes invisible 
when a new snapshot is taken. This masked the real error, though, because we 
were looking for digest mismatches, and the digest already failed prior to what 
was the second finding: 

_The transaction log_ on the server that did _not_ agree with the others (in 
particular, it typically insist there are ephemeral nodes present, but for 
sessions which died a long time ago!) was _missing entries at the end of one of 
the log files!_ The transaction log in question was the active one at the end 
of the epoch of the leader which restarted.

We've also noticed this error is more prevalent on machines with slower disks, 
where we sometimes see complaints from ZK about slow fsyncing. 

 

Putting this all together, we finally discovered that the following happens:

 

ZK3 is leader, ZK1 and ZK2 are followers. X < Z are transaction IDs. 

ZK1 is busy fsyncing transactions <= X.

ZK3 proposes Z, and ZK1 and ZK2 receive the proposal.

ZK1 enqueues the proposal, but does not ACK it. ZK2 ACKs the proposal. 

ZK3 observes quorum for Z, and sends a COMMIT to ZK1 and ZK2. 

ZK1 applies Z to its data tree, but it is still not processed by the 
SyncRequestProcessor, which is busy fsyncing. 

ZK3 shuts down, closing the socket to ZK1 and ZK2.

ZK1 stops following the dead leader, and shuts down its {{{}Follower{}}}, to 
prepare for a new leader election and epoch.

ZK1 joins a new leader election, using Z as its {{{}lastSeenZxid{}}}, because 
that was COMMITTed, and applied to its data tree. 

ZK1 completes fsync up to X, and then proceeds to ACK those transactions, but 
the socket is gone, and the sync thread dies from an unexpected NPE. The 
remaining transaction Z is not written to disk. 

ZK1 and ZK2 agree that Z is the newest transaction, ZK2 becomes the new leader, 
and ZK3 eventually rejoins as a follower. 

ZK1 has a data tree which is correct, and new transactions come in, and are 
written to the transaction log of the new epoch. 

ZK1 restarts. When it comes back up, Z is missing from its transaction log, and 
its data is inconsistent with that of the other servers. It seems we never hit 
SNAP syncs, unless we intervene, so it stays this way indefinitely. 

 

So, expecting the socket of the learner to possibly be {{{}null{}}}, in 
{{{}SendAckRequestProcessor{}}}, is enough to fix this problem in most cases, 
because that seems to be the common reason for the sync thread to die—had it 
lived on, it would eventually have synced Z as well; nevertheless, there would 
be a thread leak, and there could be other failure scenarios that I'm not aware 
of as well, both now, and certainly later. 

The attempted fix therefore also refactors the shutdown for child classes of 
{{{}ZooKeeperServer{}}}, such that they all override the {{shutdown(boolean)}} 
method, which is ultimately the one that is called. I also moved the shutdown 
(and sync) of the {{SyncRequestProcessor}} to be before the shutdown of the 
parent {{{}ZooKeeperServer{}}}, because of the {{fastForwardFromEdits()}} 
that's called in the latter in an attempt to be up to date when the leader 
election starts; obviously, this makes more sense if the pending transactions 
are actually written first, and I believe the missing synchronisation with the 
persistence thread is also what caused the duplicate series of transactions in 
the transaction logs: inflight transactions (not yet committed) could be 
written to the log by the {{SyncRequestProcessor}} in parallel with a leader 
election using an older Zxid than the tail of the log, where the resulting DIFF 
from a more up-to-date leader would _also_ contain those same entries that were 
just written. This wouldn't even be noticeable, because the data tree of the 
new follower wouldn't have the duplicate entries in it, and not complain when 
they were applied; and the transaction log itself does not detect this either. 
It would only be detectable upon database recreation. 

 

 

Anyway, long story. I hope you strongly consider this as a candidate to make 
the next release. We'll be running a patched ZK in the meantime. But I'd be 
surprised if we're the only ones seeing this. All it takes is a few servers, 
some automation, and some slow disks :) 

 

> Ephemeral znode owned by closed session visible in 1 of 3 servers
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4541
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>            Reporter: Håkon Hallingstad
>            Priority: Major
>
> We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw 
> the following znode on server 1:
> {code:java}
> stat 
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> cZxid = 0xa240000381f
> ctime = Tue May 10 17:17:27 UTC 2022
> mZxid = 0xa240000381f
> mtime = Tue May 10 17:17:27 UTC 2022
> pZxid = 0xa240000381f
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x20006d5d6a10000
> dataLength = 14
> numChildren = 0
> {code}
> This znode was absent on server 2 and 3. A delete on the node failed 
> everywhere.
> {code:java}
> delete 
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> Node does not exist: 
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}
> This makes sense as a mutable operation goes to the leader, which was server 
> 3 and where the node is absent. Restarting server 1 did not fix the issue. 
> Stopping server 1, removing the zookeeper database and version-2 directory, 
> and starting the server fixed the issue.
> Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially 
> connected to server 2. The client later closed the session at around the same 
> time as the then-leader server 2 was stopped:
> {code:java}
> 2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for 
> session 0x20006d5d6a10000 (ZooKeeperServer)
> 2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
> {code}
> That both were closed/shutdown at the same time is something we do on our 
> side. Perhaps there is a race in the handling of closing of sessions and the 
> shutdown of the leader?
> We did a {{dump}} of server 1-3, and server 1 had two sessions that did not 
> exist in server 2 and 3, and there was one ephemeral node reported on 1 that 
> was not reported on server 2 and 3. The ephemeral owner matched one of the 
> two extraneous sessions.  The dump from server 1 with the extra entries:
> {code:java}
> Global Sessions(8):
> ...
> 0x20006d5d6a10000 120000ms
> 0x2004360ecca0000 120000ms
> ...
> Sessions with Ephemerals (4):
> 0x20006d5d6a10000:
> /vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to