[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810009#comment-13810009
 ] 

chendihao commented on HBASE-9802:
----------------------------------

>From now on, We have found some tools to simulate the HW actions. The servers 
>we want to test should provide an interface for us to call them(maybe ssh or 
>http server to accept our requests). I just put out the list and welcome to 
>any suggestions.

Network Delay: Use tc like `tc qdisc add dev eth0 root netem delay 1000ms` to 
set network delay or recover.
Network Unavailable: Use iptable to block the specific port, like `iptables -A 
OUTPUT -p tcp --dport 3306 -j DROP `.
Network Bandwidth Limit: Use tc like `tc qdisc add dev eth0 root tbf rate 
5800kbit latency 50ms burst 1540` to limit the bandwidth.
Disk Full: Use dd to create a really large file to fill up the disk, like `dd 
if=/dev/zero of=/$path/tst.img bs=1M count=20K `.
Disk Failure: Maybe use fiu-ctrl or `echo offline > 
/sys/block/sda/device/state`(not test yet).
Disk Slow: Use fio to write or read a lot making the disk under stress.
Memory Limit: Impl a program from 
http://minuteware.net/simulating-high-memory-usage-in-linux/.
CPU Limit: Use cpulimit from https://github.com/opsengine/cpulimit.

Although these tools can run individually, we have a plan to integrate them as 
a failure-injection system for Linux. If it's done, the failover test framework 
can call the failure periodically. But in our situation, the client running 
failover framework can't ssh to the server directly(because of anything about 
security). We are thinking to impl a http server which accepts the requests and 
call the failures in the test machine.

> A new failover test framework for HBase
> ---------------------------------------
>
>                 Key: HBASE-9802
>                 URL: https://issues.apache.org/jira/browse/HBASE-9802
>             Project: HBase
>          Issue Type: Improvement
>          Components: test
>    Affects Versions: 0.94.3
>            Reporter: chendihao
>            Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to