[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2016-11-16 Thread Esteban Gutierrez (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671879#comment-15671879
 ] 

Esteban Gutierrez commented on HBASE-9802:
--

[~tobe] do you still have plans to contribute this back?

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Assignee: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-31 Thread chendihao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810948#comment-13810948
 ] 

chendihao commented on HBASE-9802:
--

[~tlipcon] That's really really helpful! Nice to see your project and we will 
look deep at it. Thanks again :-)  

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-31 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810871#comment-13810871
 ] 

Todd Lipcon commented on HBASE-9802:


Sounds similar to the "gremlins" project I hacked together a few years ago: 
http://github.com/toddlipcon/gremlins

I'm sure yours will be more extensive (gremlins was a quick hack), but worth 
checking out the above for some ideas.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-31 Thread chendihao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810009#comment-13810009
 ] 

chendihao commented on HBASE-9802:
--

>From now on, We have found some tools to simulate the HW actions. The servers 
>we want to test should provide an interface for us to call them(maybe ssh or 
>http server to accept our requests). I just put out the list and welcome to 
>any suggestions.

Network Delay: Use tc like `tc qdisc add dev eth0 root netem delay 1000ms` to 
set network delay or recover.
Network Unavailable: Use iptable to block the specific port, like `iptables -A 
OUTPUT -p tcp --dport 3306 -j DROP `.
Network Bandwidth Limit: Use tc like `tc qdisc add dev eth0 root tbf rate 
5800kbit latency 50ms burst 1540` to limit the bandwidth.
Disk Full: Use dd to create a really large file to fill up the disk, like `dd 
if=/dev/zero of=/$path/tst.img bs=1M count=20K `.
Disk Failure: Maybe use fiu-ctrl or `echo offline > 
/sys/block/sda/device/state`(not test yet).
Disk Slow: Use fio to write or read a lot making the disk under stress.
Memory Limit: Impl a program from 
http://minuteware.net/simulating-high-memory-usage-in-linux/.
CPU Limit: Use cpulimit from https://github.com/opsengine/cpulimit.

Although these tools can run individually, we have a plan to integrate them as 
a failure-injection system for Linux. If it's done, the failover test framework 
can call the failure periodically. But in our situation, the client running 
failover framework can't ssh to the server directly(because of anything about 
security). We are thinking to impl a http server which accepts the requests and 
call the failures in the test machine.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-21 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801353#comment-13801353
 ] 

Enis Soztutar commented on HBASE-9802:
--

Sounds interesting, but some of the listed items are already design points in 
current CM.
bq. 1) Only process-level actions can be simulated, not support 
machine-level/hardware-level/network-level actions.
Action class is not specific to process. HW actions can easily be implemented 
as subclasses. 
bq. 2) No data validation before and after the test, the fatal bugs such as 
that can cause data inconsistent may be overlook.
I think it is not the duty of the CM to verify the data, since verification of 
the data is test specific. 
bq. 3) When failure occurs, we can't repro the problem and hard to figure out 
the reason.
Agreed. There has been some discussions for choosing a seed for random actions, 
so that the actions can be replayed. Also we can do a replay log. 

Would love to see a patch. 

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-21 Thread chendihao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800475#comment-13800475
 ] 

chendihao commented on HBASE-9802:
--

We don't use IT test a lot and think it's less aggressive. As [~eclark] said, 
the class IntegrationTestBigLinkedListWithChaosMonkey may have verified data, 
but we treat this framework as a external tool. We impl a DataValidateTool to 
randomly read/put/delete data(simulate a real client), then read the value from 
HBase and compared with the expected value which is stored in memory and 
reliable. It's an easy way for us to validate data whenever we 
want(before/during/after failover test), and ensure the availability and data 
correctness.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-20 Thread Liang Xie (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800324#comment-13800324
 ] 

Liang Xie commented on HBASE-9802:
--

Dihao, you can assign this jira to yourself:)

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-18 Thread Elliott Clark (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799379#comment-13799379
 ] 

Elliott Clark commented on HBASE-9802:
--

bq.2) No data validation before and after the test, the fatal bugs such as that 
can cause data inconsistent may be overlook.
Can you talk a little more about this? Pretty much all of the IT tests verify 
that the data written is present and consistent.  For example Big Linked List 
Writes a circular linked list of 25 million links.  It then check to verify 
that all of the links are there before starting another iteration.  The bulk 
load test does about the same; Test Load and Verify as well.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-18 Thread Andrew Purtell (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799297#comment-13799297
 ] 

Andrew Purtell commented on HBASE-9802:
---

Sounds great

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-18 Thread chendihao (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799133#comment-13799133
 ] 

chendihao commented on HBASE-9802:
--

Thanks for paying attention on our work. Now we're trying to seperate HBase 
things from this framework and reuse for HDFS, zookeeper and other HA services. 
Just like what [~ste...@apache.org] has said, we want to make it more generic 
and just provide an extensible framework, then everyone can implement their 
actions to inject failures in their system. 

Thank [~elserj] and we will learn more about Accumulo. Currently we use 
tc(traffic control) to simulate network delay,  dd to make disk full and other 
tools to simulate network/disk/cpu/memory failure. It would be helpful if our 
test servers provide these interfaces to use. I think we can do it generally 
and share with community.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-18 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799108#comment-13799108
 ] 

Josh Elser commented on HBASE-9802:
---

bq. Accumulo has something similar, though I've not seen it

Accumulo's "continuous ingest" test harness does satisfy the data validation 
requirement (IIRC, the data ingest is a giant linked list and validation occurs 
over the traversal of that list, making sure that we aren't missing any 
pointers). However, figuring out what happened in the case of failure can still 
be rather onerous, coalescing all the logs by hand (after figuring out which 
logs to look at) to get a picture of what happened.

Failure happens at the process level as well (pkill of processes at some 
interval). Being able to do heinous things like drop an entire rack (switch 
failure) would definitely be something worthwhile for Accumulo to also pick up 
(when running said tests over enough nodes). I'm excited to see what you all 
have made.

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HBASE-9802) A new failover test framework for HBase

2013-10-18 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799098#comment-13799098
 ] 

Steve Loughran commented on HBASE-9802:
---

This sounds interesting and potentially very useful beyond just HBase. Hadoop 
YARN applications are the obvious target, as they need to be written to expect 
failure, and if they don't get tested, well, they won't work. I ended up doing 
some basics of this with ssh and reboot operations, but I really wanted 
something that could talk to an open WRT base station and actually generate 
real network partitions, rather than just simulations. 

# Accumulo has something similar, though I've not seen it
# would it be possible to make this more generic? Even if starts off in HBase, 
it could be good to have the option of branching off into its own project -and 
to allow people downstream to use it even earlier.

I'd propose making the core test framework a module that could be picked up and 
used downstream, precisely to get that cross-application testing

> A new failover test framework for HBase
> ---
>
> Key: HBASE-9802
> URL: https://issues.apache.org/jira/browse/HBASE-9802
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Affects Versions: 0.94.3
>Reporter: chendihao
>Priority: Minor
>
> Currently HBase uses ChaosMonkey for IT test and fault injection. It will 
> restart regionserver, force balancer and perform other actions randomly and 
> periodically. However, we need a more extensible and full-featured framework 
> for our failover test and we find ChaosMonkey cant' suit our needs since it 
> has the following drawbacks.
> 1) Only process-level actions can be simulated, not support 
> machine-level/hardware-level/network-level actions.
> 2) No data validation before and after the test, the fatal bugs such as that 
> can cause data inconsistent may be overlook.
> 3) When failure occurs, we can't repro the problem and hard to figure out the 
> reason.
> Therefore, we have developed a new framework to satisfy the need of failover 
> test. We extended ChaosMonkey and implement the function to validate data and 
> to replay failed actions. Here are the features we add.
> 1) Policy/Task/Action abstraction, seperating Task from Policy and Action 
> makes it easier to manage and replay a set of actions.
> 2) Make action configurable. We have implemented some actions to cause 
> machine failure and defined the same interface as original actions.
> 3) We should validate the date consistent before and after failover test to 
> ensure the availability and data correctness.
> 4) After performing a set of actions, we also check the consistency of table 
> as well.
> 5) The set of actions that caused test failure can be replayed, and the 
> reproducibility of actions can help fixing the exposed bugs.
> Our team has developed this framework and run for a while. Some bugs were 
> exposed and fixed by running this test framework. Moreover, we have a monitor 
> program which shows the progress of failover test and make sure our cluster 
> is as stable as we want. Now we are trying to make it more general and will 
> opensource it later.



--
This message was sent by Atlassian JIRA
(v6.1#6144)