chendihao created HBASE-9802:
--------------------------------

             Summary: A new failover test framework for HBase
                 Key: HBASE-9802
                 URL: https://issues.apache.org/jira/browse/HBASE-9802
             Project: HBase
          Issue Type: Improvement
          Components: test
    Affects Versions: 0.94.3
            Reporter: chendihao
            Priority: Minor


Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
restart regionserver, force balancer and perform other actions randomly and 
periodically. However, we need a more extensible and full-featured framework 
for our failover test and we find ChaosMonkey cant' suit our needs since it has 
the following drawbacks.

1) Only process-level actions can be simulated, not support 
machine-level/hardware-level/network-level actions.
2) No data validation before and after the test, the fatal bugs such as that 
can cause data inconsistent may be overlook.
3) When failure occurs, we can't repro the problem and hard to figure out the 
reason.

Therefore, we have developed a new framework to satisfy the need of failover 
test. We extended ChaosMonkey and implement the function to validate data and 
to replay failed actions. Here are the features we add.

1) Policy/Task/Action abstraction, seperating Task from Policy and Action makes 
it easier to manage and replay a set of actions.
2) Make action configurable. We have implemented some actions to cause machine 
failure and defined the same interface as original actions.
3) We should validate the date consistent before and after failover test to 
ensure the availability and data correctness.
4) After performing a set of actions, we also check the consistency of table as 
well.
5) The set of actions that caused test failure can be replayed, and the 
reproducibility of actions can help fixing the exposed bugs.

Our team has developed this framework and run for a while. Some bugs were 
exposed and fixed by running this test framework. Moreover, we have a monitor 
program which shows the progress of failover test and make sure our cluster is 
as stable as we want. Now we are trying to make it more general and will 
opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to