To follow up, I've needed to apply these two patches to get my local environment running.
https://issues.apache.org/jira/browse/HBASE-24360 https://issues.apache.org/jira/browse/HBASE-24361 On Tue, May 12, 2020 at 11:52 AM Nick Dimiduk <ndimi...@apache.org> wrote: > Thanks Zach. > > > It actually performs even worse in this case in my experience since > Chaos monkey can consider the failure mechanism to have failed (and > eventually times out) because the process is too quick to recover (or the > recovery fails because the process is already running). The only way I was > able to get it to run was to disable the process that automatically > restarts killed processes in my system. > > Interesting observation. > > > This brings up a discussion on whether the ITBLL (or whatever process) > should even continue if either a killing or recovering action failed. > I would argue that invalidates the entire test, but it might not be obvious > it failed unless you were watching the logs as it went. > > I'm coming to a similar conclusion -- failure in the orchestration layer > should invalidate the test. > > On Thu, May 7, 2020 at 5:27 PM Zach York <zyork.contribut...@gmail.com> > wrote: > >> I should note that I was using HBase 2.2.3 to test. >> >> On Thu, May 7, 2020 at 5:26 PM Zach York <zyork.contribut...@gmail.com> >> wrote: >> >> > I recently ran ITBLL with Chaos monkey[1] against a real HBase >> > installation (EMR). I initially tried to run it locally, but couldn't >> get >> > it working and eventually gave up. >> > >> > > So I'm curious if this matches others' experience running the monkey. >> For >> > example, do you have an environment more resilient than mine, one where >> an >> > external actor is restarting downed processed without the monkey >> action's >> > involvement? >> > >> > It actually performs even worse in this case in my experience since >> Chaos >> > monkey can consider the failure mechanism to have failed (and eventually >> > times out) >> > because the process is too quick to recover (or the recovery fails >> because >> > the process is already running). The only way I was able to get it to >> run >> > was to disable >> > the process that automatically restarts killed processes in my system. >> > >> > One other thing I hit was the validation for a suspended process was >> > incorrect so if chaos monkey tried to suspend the process the run would >> > fail. I'll put up a JIRA for that. >> > >> > This brings up a discussion on whether the ITBLL (or whatever process) >> > should even continue if either a killing or recovering action failed. I >> > would argue that invalidates the entire test, >> > but it might not be obvious it failed unless you were watching the logs >> as >> > it went. >> > >> > Thanks, >> > Zach >> > >> > >> > [1] sudo -u hbase hbase >> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m >> serverKilling >> > loop 4 2 1000000 ${RANDOM} 10 >> > >> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk <ndimi...@apache.org> >> wrote: >> > >> >> Hello, >> >> >> >> Does anyone have recent experience running Chaos Monkey? Are you >> running >> >> against an external cluster, or one of the other modes? What monkey >> >> factory >> >> are you using? Any property overrides? A non-default ClusterManager? >> >> >> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not >> having >> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts >> >> each, the relatively simple "serverKilling" factory with >> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various >> hosts >> >> on >> >> various scheduled, plus some balancer play mixed in; no process >> >> suspension. >> >> >> >> Running for any length of time (~30 minutes) the chaos monkey >> eventually >> >> terminates between a majority and all of the hosts in the cluster. My >> logs >> >> are peppered with warnings such as the below. There are other >> variants. As >> >> far as I can tell, actions are intended to cause some harm and then >> >> restore >> >> state after themselves. In practice, the harm is successful but >> >> restoration >> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec >> >> timeout. The result is a methodical termination of the cluster. >> >> >> >> So I'm curious if this matches others' experience running the monkey. >> For >> >> example, do you have an environment more resilient than mine, one >> where an >> >> external actor is restarting downed processed without the monkey >> action's >> >> involvement? Is the monkey designed to run only in such an environment? >> >> These timeouts are configurable; are you cranking them way up? >> >> >> >> Any input you have would be greatly appreciated. This is my last major >> >> action item blocking initial 2.3.0 release candidates. >> >> >> >> Thanks, >> >> Nick >> >> >> >> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during >> >> performing action: java.io.IOException: did timeout 60000ms waiting for >> >> region server to start: host-a.example.com >> >> at >> >> >> >> >> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163) >> >> at >> >> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228) >> >> at >> >> >> >> >> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70) >> >> at >> >> >> >> >> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61) >> >> at >> >> >> >> >> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50) >> >> at >> >> >> >> >> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) >> >> at >> >> >> >> >> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) >> >> at java.base/java.lang.Thread.run(Thread.java:834) >> >> >> > >> >