[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

chendihao (JIRA) Fri, 18 Oct 2013 07:36:03 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799133#comment-13799133
 ]


chendihao commented on HBASE-9802:
----------------------------------

Thanks for paying attention on our work. Now we're trying to seperate HBase 
things from this framework and reuse for HDFS, zookeeper and other HA services. 
Just like what [[email protected]] has said, we want to make it more generic 
and just provide an extensible framework, then everyone can implement their 
actions to inject failures in their system. 

Thank [~elserj] and we will learn more about Accumulo. Currently we use 
tc(traffic control) to simulate network delay,  dd to make disk full and other 
tools to simulate network/disk/cpu/memory failure. It would be helpful if our 
test servers provide these interfaces to use. I think we can do it generally 
and share with community.

> A new failover test framework for HBase
> ---------------------------------------
>
>                 Key: HBASE-9802
>                 URL: https://issues.apache.org/jira/browse/HBASE-9802
>             Project: HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.94.3
>            Reporter: chendihao
>            Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

Reply via email to