Håkon Hallingstad created ZOOKEEPER-4541:
--------------------------------------------
Summary: Ephemeral znode owned by closed session visible in 1 of 3
servers
Key: ZOOKEEPER-4541
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4541
Project: ZooKeeper
Issue Type: Bug
Components: quorum, server
Reporter: Håkon Hallingstad
We have a ZooKeeper 3.7.0 ensemble with servers 1-3. Using zkCli.sh we saw the
following znode on server 1:
{code:java}
stat
/vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
cZxid = 0xa240000381f
ctime = Tue May 10 17:17:27 UTC 2022
mZxid = 0xa240000381f
mtime = Tue May 10 17:17:27 UTC 2022
pZxid = 0xa240000381f
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x20006d5d6a10000
dataLength = 14
numChildren = 0
{code}
This znode was absent on server 2 and 3. A delete on the node failed everywhere.
{code:java}
delete
/vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
Node does not exist:
/vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
{code}
This makes sense as a mutable operation goes to the leader, which was server 3
and where the node is absent. Restarting server 1 did not fix the issue.
Stopping server 1, removing the zookeeper database and version-2 directory, and
starting the server fixed the issue.
Session 0x20006d5d6a10000 was created by a ZooKeeper client and initially
connected to server 2. The client later closed the session at around the same
time as the then-leader server 2 was stopped:
{code:java}
2022-05-10 17:17:28.141 I - cfg2 Submitting global closeSession request for
session 0x20006d5d6a10000 (ZooKeeperServer)
2022-05-10 17:17:28.145 I - cfg2 Session: 0x20006d5d6a10000 closed (ZooKeeper)
{code}
That both were closed/shutdown at the same time is something we do on our side.
Perhaps there is a race in the handling of closing of sessions and the shutdown
of the leader?
We did a {{dump}} of server 1-3, and server 1 had two sessions that did not
exist in server 2 and 3, and there was one ephemeral node reported on 1 that
was not reported on server 2 and 3. The ephemeral owner matched one of the two
extraneous sessions. The dump from server 1 with the extra entries:
{code:java}
Global Sessions(8):
...
0x20006d5d6a10000 120000ms
0x2004360ecca0000 120000ms
...
Sessions with Ephemerals (4):
0x20006d5d6a10000:
/vespa/host-status-service/hosted-vespa:zone-config-servers/lock2/_c_be5a2484-da16-45a6-9a98-4657924040e3-lock-0000024982
{code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)