[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671879#comment-15671879 ]
Esteban Gutierrez commented on HBASE-9802: ------------------------------------------ [~tobe] do you still have plans to contribute this back? > A new failover test framework for HBase > --------------------------------------- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test > Affects Versions: 0.94.3 > Reporter: chendihao > Assignee: chendihao > Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)