Jize Ning created HBASE-29652:
---------------------------------
Summary: Chaos testing in ZK mode does not work on hosts with
ZNode persistence issue
Key: HBASE-29652
URL: https://issues.apache.org/jira/browse/HBASE-29652
Project: HBase
Issue Type: Bug
Components: integration tests
Affects Versions: 2.5.12, 2.6.3
Reporter: Jize Ning
Chaos testing in ZK mode involves a client (ChaosZkClient) passing commands to
agents (chaosAgent). The agents will execute the commands on the host to
kill/restart hbase processes. If the chaosAgent process on the worker node is
restarted within a short amount of time, it may fail to register itself. Then
the chaosAgent will no longer receive any commands from the client.
During chaos testing setup on a worker node, the ChaosAgent will try to
register an ephemeral ZNode only during initialization
{code:java}
/hbase/chaosAgents/<hostname>{code}
The ChaosZKClient would check its existence before passing commands to the
agents. If the ZNode is deleted, the ChaosZkClient will lose track of the agent
and the agent will not receive any commands anymore. This issue could also
happen when the ChaosAgent process is restarted on the same host. The ephemeral
ZNode from the first session has not timed out so the agent would not recreate
it during the second initialization. When the ephemeral ZNode eventually times
out, the agent would become an orphan without throwing any exception.
There is a very simple fix. We can add a Watcher to the ephemeral ZNode
creation to always recreate it when it gets deleted. This can ensure that the
chaos agent is always reachable from the ChaosZkClient in its lifecycle.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)