[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799133#comment-13799133 ]
chendihao commented on HBASE-9802: ---------------------------------- Thanks for paying attention on our work. Now we're trying to seperate HBase things from this framework and reuse for HDFS, zookeeper and other HA services. Just like what [~ste...@apache.org] has said, we want to make it more generic and just provide an extensible framework, then everyone can implement their actions to inject failures in their system. Thank [~elserj] and we will learn more about Accumulo. Currently we use tc(traffic control) to simulate network delay, dd to make disk full and other tools to simulate network/disk/cpu/memory failure. It would be helpful if our test servers provide these interfaces to use. I think we can do it generally and share with community. > A new failover test framework for HBase > --------------------------------------- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test > Affects Versions: 0.94.3 > Reporter: chendihao > Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)