[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810871#comment-13810871
 ] 

Todd Lipcon commented on HBASE-9802:
------------------------------------

Sounds similar to the "gremlins" project I hacked together a few years ago: 
http://github.com/toddlipcon/gremlins

I'm sure yours will be more extensive (gremlins was a quick hack), but worth 
checking out the above for some ideas.

> A new failover test framework for HBase
> ---------------------------------------
>
>                 Key: HBASE-9802
>                 URL: https://issues.apache.org/jira/browse/HBASE-9802
>             Project: HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.94.3
>            Reporter: chendihao
>            Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to