[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671879#comment-15671879 ] Esteban Gutierrez commented on HBASE-9802: -- [~tobe] do you still have plans to contribute this back? > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Assignee: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810948#comment-13810948 ] chendihao commented on HBASE-9802: -- [~tlipcon] That's really really helpful! Nice to see your project and we will look deep at it. Thanks again :-) > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810871#comment-13810871 ] Todd Lipcon commented on HBASE-9802: Sounds similar to the "gremlins" project I hacked together a few years ago: http://github.com/toddlipcon/gremlins I'm sure yours will be more extensive (gremlins was a quick hack), but worth checking out the above for some ideas. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13810009#comment-13810009 ] chendihao commented on HBASE-9802: -- >From now on, We have found some tools to simulate the HW actions. The servers >we want to test should provide an interface for us to call them(maybe ssh or >http server to accept our requests). I just put out the list and welcome to >any suggestions. Network Delay: Use tc like `tc qdisc add dev eth0 root netem delay 1000ms` to set network delay or recover. Network Unavailable: Use iptable to block the specific port, like `iptables -A OUTPUT -p tcp --dport 3306 -j DROP `. Network Bandwidth Limit: Use tc like `tc qdisc add dev eth0 root tbf rate 5800kbit latency 50ms burst 1540` to limit the bandwidth. Disk Full: Use dd to create a really large file to fill up the disk, like `dd if=/dev/zero of=/$path/tst.img bs=1M count=20K `. Disk Failure: Maybe use fiu-ctrl or `echo offline > /sys/block/sda/device/state`(not test yet). Disk Slow: Use fio to write or read a lot making the disk under stress. Memory Limit: Impl a program from http://minuteware.net/simulating-high-memory-usage-in-linux/. CPU Limit: Use cpulimit from https://github.com/opsengine/cpulimit. Although these tools can run individually, we have a plan to integrate them as a failure-injection system for Linux. If it's done, the failover test framework can call the failure periodically. But in our situation, the client running failover framework can't ssh to the server directly(because of anything about security). We are thinking to impl a http server which accepts the requests and call the failures in the test machine. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13801353#comment-13801353 ] Enis Soztutar commented on HBASE-9802: -- Sounds interesting, but some of the listed items are already design points in current CM. bq. 1) Only process-level actions can be simulated, not support machine-level/hardware-level/network-level actions. Action class is not specific to process. HW actions can easily be implemented as subclasses. bq. 2) No data validation before and after the test, the fatal bugs such as that can cause data inconsistent may be overlook. I think it is not the duty of the CM to verify the data, since verification of the data is test specific. bq. 3) When failure occurs, we can't repro the problem and hard to figure out the reason. Agreed. There has been some discussions for choosing a seed for random actions, so that the actions can be replayed. Also we can do a replay log. Would love to see a patch. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800475#comment-13800475 ] chendihao commented on HBASE-9802: -- We don't use IT test a lot and think it's less aggressive. As [~eclark] said, the class IntegrationTestBigLinkedListWithChaosMonkey may have verified data, but we treat this framework as a external tool. We impl a DataValidateTool to randomly read/put/delete data(simulate a real client), then read the value from HBase and compared with the expected value which is stored in memory and reliable. It's an easy way for us to validate data whenever we want(before/during/after failover test), and ensure the availability and data correctness. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800324#comment-13800324 ] Liang Xie commented on HBASE-9802: -- Dihao, you can assign this jira to yourself:) > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799379#comment-13799379 ] Elliott Clark commented on HBASE-9802: -- bq.2) No data validation before and after the test, the fatal bugs such as that can cause data inconsistent may be overlook. Can you talk a little more about this? Pretty much all of the IT tests verify that the data written is present and consistent. For example Big Linked List Writes a circular linked list of 25 million links. It then check to verify that all of the links are there before starting another iteration. The bulk load test does about the same; Test Load and Verify as well. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799297#comment-13799297 ] Andrew Purtell commented on HBASE-9802: --- Sounds great > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799133#comment-13799133 ] chendihao commented on HBASE-9802: -- Thanks for paying attention on our work. Now we're trying to seperate HBase things from this framework and reuse for HDFS, zookeeper and other HA services. Just like what [~ste...@apache.org] has said, we want to make it more generic and just provide an extensible framework, then everyone can implement their actions to inject failures in their system. Thank [~elserj] and we will learn more about Accumulo. Currently we use tc(traffic control) to simulate network delay, dd to make disk full and other tools to simulate network/disk/cpu/memory failure. It would be helpful if our test servers provide these interfaces to use. I think we can do it generally and share with community. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799108#comment-13799108 ] Josh Elser commented on HBASE-9802: --- bq. Accumulo has something similar, though I've not seen it Accumulo's "continuous ingest" test harness does satisfy the data validation requirement (IIRC, the data ingest is a giant linked list and validation occurs over the traversal of that list, making sure that we aren't missing any pointers). However, figuring out what happened in the case of failure can still be rather onerous, coalescing all the logs by hand (after figuring out which logs to look at) to get a picture of what happened. Failure happens at the process level as well (pkill of processes at some interval). Being able to do heinous things like drop an entire rack (switch failure) would definitely be something worthwhile for Accumulo to also pick up (when running said tests over enough nodes). I'm excited to see what you all have made. > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-9802) A new failover test framework for HBase
[ https://issues.apache.org/jira/browse/HBASE-9802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799098#comment-13799098 ] Steve Loughran commented on HBASE-9802: --- This sounds interesting and potentially very useful beyond just HBase. Hadoop YARN applications are the obvious target, as they need to be written to expect failure, and if they don't get tested, well, they won't work. I ended up doing some basics of this with ssh and reboot operations, but I really wanted something that could talk to an open WRT base station and actually generate real network partitions, rather than just simulations. # Accumulo has something similar, though I've not seen it # would it be possible to make this more generic? Even if starts off in HBase, it could be good to have the option of branching off into its own project -and to allow people downstream to use it even earlier. I'd propose making the core test framework a module that could be picked up and used downstream, precisely to get that cross-application testing > A new failover test framework for HBase > --- > > Key: HBASE-9802 > URL: https://issues.apache.org/jira/browse/HBASE-9802 > Project: HBase > Issue Type: Improvement > Components: test >Affects Versions: 0.94.3 >Reporter: chendihao >Priority: Minor > > Currently HBase uses ChaosMonkey for IT test and fault injection. It will > restart regionserver, force balancer and perform other actions randomly and > periodically. However, we need a more extensible and full-featured framework > for our failover test and we find ChaosMonkey cant' suit our needs since it > has the following drawbacks. > 1) Only process-level actions can be simulated, not support > machine-level/hardware-level/network-level actions. > 2) No data validation before and after the test, the fatal bugs such as that > can cause data inconsistent may be overlook. > 3) When failure occurs, we can't repro the problem and hard to figure out the > reason. > Therefore, we have developed a new framework to satisfy the need of failover > test. We extended ChaosMonkey and implement the function to validate data and > to replay failed actions. Here are the features we add. > 1) Policy/Task/Action abstraction, seperating Task from Policy and Action > makes it easier to manage and replay a set of actions. > 2) Make action configurable. We have implemented some actions to cause > machine failure and defined the same interface as original actions. > 3) We should validate the date consistent before and after failover test to > ensure the availability and data correctness. > 4) After performing a set of actions, we also check the consistency of table > as well. > 5) The set of actions that caused test failure can be replayed, and the > reproducibility of actions can help fixing the exposed bugs. > Our team has developed this framework and run for a while. Some bugs were > exposed and fixed by running this test framework. Moreover, we have a monitor > program which shows the progress of failover test and make sure our cluster > is as stable as we want. Now we are trying to make it more general and will > opensource it later. -- This message was sent by Atlassian JIRA (v6.1#6144)