To follow up, I've needed to apply these two patches to get my local
environment running.

https://issues.apache.org/jira/browse/HBASE-24360
https://issues.apache.org/jira/browse/HBASE-24361

On Tue, May 12, 2020 at 11:52 AM Nick Dimiduk <ndimi...@apache.org> wrote:

> Thanks Zach.
>
> > It actually performs even worse in this case in my experience since
> Chaos monkey can consider the failure mechanism to have failed (and
> eventually times out) because the process is too quick to recover (or the
> recovery fails because the process is already running). The only way I was
> able to get it to run was to disable the process that automatically
> restarts killed processes in my system.
>
> Interesting observation.
>
> > This brings up a discussion on whether the ITBLL (or whatever process)
> should even continue if either a killing or recovering action failed.
> I would argue that invalidates the entire test, but it might not be obvious
> it failed unless you were watching the logs as it went.
>
> I'm coming to a similar conclusion -- failure in the orchestration layer
> should invalidate the test.
>
> On Thu, May 7, 2020 at 5:27 PM Zach York <zyork.contribut...@gmail.com>
> wrote:
>
>> I should note that I was using HBase 2.2.3 to test.
>>
>> On Thu, May 7, 2020 at 5:26 PM Zach York <zyork.contribut...@gmail.com>
>> wrote:
>>
>> > I recently ran ITBLL with Chaos monkey[1] against a real HBase
>> > installation (EMR). I initially tried to run it locally, but couldn't
>> get
>> > it working and eventually gave up.
>> >
>> > > So I'm curious if this matches others' experience running the monkey.
>> For
>> > example, do you have an environment more resilient than mine, one where
>> an
>> > external actor is restarting downed processed without the monkey
>> action's
>> > involvement?
>> >
>> > It actually performs even worse in this case in my experience since
>> Chaos
>> > monkey can consider the failure mechanism to have failed (and eventually
>> > times out)
>> > because the process is too quick to recover (or the recovery fails
>> because
>> > the process is already running). The only way I was able to get it to
>> run
>> > was to disable
>> > the process that automatically restarts killed processes in my system.
>> >
>> > One other thing I hit was the validation for a suspended process was
>> > incorrect so if chaos monkey tried to suspend the process the run would
>> > fail. I'll put up a JIRA for that.
>> >
>> > This brings up a discussion on whether the ITBLL (or whatever process)
>> > should even continue if either a killing or recovering action failed. I
>> > would argue that invalidates the entire test,
>> > but it might not be obvious it failed unless you were watching the logs
>> as
>> > it went.
>> >
>> > Thanks,
>> > Zach
>> >
>> >
>> > [1] sudo -u hbase hbase
>> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m
>> serverKilling
>> > loop 4 2 1000000 ${RANDOM} 10
>> >
>> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk <ndimi...@apache.org>
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> Does anyone have recent experience running Chaos Monkey? Are you
>> running
>> >> against an external cluster, or one of the other modes? What monkey
>> >> factory
>> >> are you using? Any property overrides? A non-default ClusterManager?
>> >>
>> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not
>> having
>> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
>> >> each, the relatively simple "serverKilling" factory with
>> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various
>> hosts
>> >> on
>> >> various scheduled, plus some balancer play mixed in; no process
>> >> suspension.
>> >>
>> >> Running for any length of time (~30 minutes) the chaos monkey
>> eventually
>> >> terminates between a majority and all of the hosts in the cluster. My
>> logs
>> >> are peppered with warnings such as the below. There are other
>> variants. As
>> >> far as I can tell, actions are intended to cause some harm and then
>> >> restore
>> >> state after themselves. In practice, the harm is successful but
>> >> restoration
>> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
>> >> timeout. The result is a methodical termination of the cluster.
>> >>
>> >> So I'm curious if this matches others' experience running the monkey.
>> For
>> >> example, do you have an environment more resilient than mine, one
>> where an
>> >> external actor is restarting downed processed without the monkey
>> action's
>> >> involvement? Is the monkey designed to run only in such an environment?
>> >> These timeouts are configurable; are you cranking them way up?
>> >>
>> >> Any input you have would be greatly appreciated. This is my last major
>> >> action item blocking initial 2.3.0 release candidates.
>> >>
>> >> Thanks,
>> >> Nick
>> >>
>> >> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
>> >> performing action: java.io.IOException: did timeout 60000ms waiting for
>> >> region server to start: host-a.example.com
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
>> >>         at
>> >> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
>> >>         at
>> >>
>> >>
>> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
>> >>         at java.base/java.lang.Thread.run(Thread.java:834)
>> >>
>> >
>>
>

Reply via email to