Jize Ning created HBASE-29652:
---------------------------------

             Summary: Chaos testing in ZK mode does not work on hosts with 
ZNode persistence issue
                 Key: HBASE-29652
                 URL: https://issues.apache.org/jira/browse/HBASE-29652
             Project: HBase
          Issue Type: Bug
          Components: integration tests
    Affects Versions: 2.5.12, 2.6.3
            Reporter: Jize Ning


Chaos testing in ZK mode involves a client (ChaosZkClient) passing commands to 
agents (chaosAgent). The agents will execute the commands on the host to 
kill/restart hbase processes. If the chaosAgent process on the worker node is 
restarted within a short amount of time, it may fail to register itself. Then 
the chaosAgent will no longer receive any commands from the client. 

 

During chaos testing setup on a worker node, the ChaosAgent will try to 
register an ephemeral ZNode only during initialization
{code:java}
/hbase/chaosAgents/<hostname>{code}
The ChaosZKClient would check its existence before passing commands to the 
agents. If the ZNode is deleted, the ChaosZkClient will lose track of the agent 
and the agent will not receive any commands anymore. This issue could also 
happen when the ChaosAgent process is restarted on the same host. The ephemeral 
ZNode from the first session has not timed out so the agent would not recreate 
it during the second initialization. When the ephemeral ZNode eventually times 
out, the agent would become an orphan without throwing any exception. 

 

There is a very simple fix. We can add a Watcher to the ephemeral ZNode 
creation to always recreate it when it gets deleted. This can ensure that the 
chaos agent is always reachable from the ChaosZkClient in its lifecycle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to