Re: Intentionally introducing failures into Juju

Menno Smits Wed, 13 Aug 2014 17:38:56 -0700

I like the idea of being able to trigger failures stochastically. I'll
integrate this into whatever we settle on for Juju's failure injection.



On 14 August 2014 02:29, Gustavo Niemeyer <gustavo.nieme...@canonical.com>
wrote:

> Ah, and one more thing: when developing the chaos-injection mechanism
> in the mgo/txn package, I also added both a "chance" parameter for
> either killing or slowing down a given breakpoint. It sounds like it
> would be useful for juju's mechanism too. If you kill every time, it's
> hard to tell whether the system would know how to retry properly.
> Killing or slowing down just sometimes, or perhaps the first 2 times
> out of every 3, for example, would enable the system to recover
> itself, and an external agent to ensure it continues to work properly.
>
> On Wed, Aug 13, 2014 at 11:25 AM, Gustavo Niemeyer
> <gustavo.nieme...@canonical.com> wrote:
> > That's a nice direction, Menno.
> >
> > The main thing that comes to mind is that it sounds quite inconvenient
> > to turn the feature on. It may sound otherwise because it's so easy to
> > drop files at arbitrary places in our local machines, but when dealing
> > with a distributed system that knows how to spawn its own resources
> > up, suddenly the "just write a file" becomes surprisingly boring and
> > race prone.
> >
> > What about:
> >
> >     juju inject-failure [--unit=unit] [--service=service] <failure
> name>"?
> >     juju deploy [--inject-failure=name] ...
> >
> >
> >
> > On Wed, Aug 13, 2014 at 7:17 AM, Menno Smits <menno.sm...@canonical.com>
> wrote:
> >> There's been some discussion recently about adding some feature to Juju
> to
> >> allow developers or CI tests to intentionally trigger otherwise hard to
> >> induce failures in specific parts of Juju. The idea is that sometimes we
> >> need some kind of failure to happen in a CI test or when manually
> testing
> >> but those failures can often be hard to make happen.
> >>
> >> For example, for changes Juju's upgrade mechanics that I'm working on
> at the
> >> moment I would like to ensure that an upgrade is cleanly aborted if one
> of
> >> the state servers in a HA environment refuses to start the upgrade. This
> >> logic is well unit tested but there's nothing like seeing it actually
> work
> >> in a real environment to build confidence - however, it isn't easy to
> make a
> >> state server misbehave in this way.
> >>
> >> To help with this kind of testing scenario, I've created a new top-level
> >> package called "wrench" which lets us "drop a wrench in the works" so to
> >> speak. It's very simple with one main API which can be called from
> >> judiciously chosen points in Juju's execution to decide whether some
> failure
> >> should be triggered.
> >>
> >> The module looks for files in $jujudatadir/wrench (typically
> >> /var/lib/juju/wrench) on the local machine. If I wanted to trigger the
> >> upgrade failure described above I could drop a file in that directory
> on one
> >> of the state servers named say "machine-agent" with the content:
> >>
> >> refuse-upgrade
> >>
> >> Then in some part of jujud's upgrade code there could be a check like:
> >>
> >> if wrench.IsActive("machine-agent", "refuse-upgrade") {
> >>      // trigger the failure
> >> }
> >>
> >> The idea is this check would be left in the code to aid CI tests and
> future
> >> manual tests.
> >>
> >> You can see the incomplete wrench package here:
> >> https://github.com/juju/juju/pull/508
> >>
> >> There are a few issues to nut out.
> >>
> >> 1. It needs to be difficult/impossible for someone to accidentally or
> >> maliciously activate this feature, especially in production
> environments. I
> >> have almost finished (but not pushed to Github) some changes to the
> wrench
> >> package which make it strict about the ownership and permissions on the
> >> wrench files. This should make it harder for the wrong person to drop
> files
> >> in to the wrench directory.
> >>
> >> The idea has also been floated to only enable this functionality in
> >> non-stable builds. This certainly gives a good level of protection but
> I'm
> >> slightly wary of this approach because it makes it impossible for CI to
> take
> >> advantage of the wrench feature when testing stable release builds. I'm
> >> happy to be convinced that the benefit is worth the cost.
> >>
> >> Other ideas on how to better handle this are very welcome.
> >>
> >> 2. The wrench functionality needs to be disabled during unit test runs
> >> because we don't want any wrench files a developer may have lying
> around to
> >> affect Juju's behaviour during test runs. The wrench package has a
> global
> >> on/off switch so I plan on switching it off in BaseSuite's setup or
> similar.
> >>
> >> 3. The name is a bikeshedding magnet :)  Other names that have been
> bandied
> >> about for this feature are "chaos" and "spanner". I don't care too much
> so
> >> if there's a strong consensus for another name let's use that. I chose
> >> "wrench" over "spanner" because I believe that's the more common usage
> in
> >> the US and because Spanner is a DB from Google. Let's not get carried
> >> away...
> >>
> >> All comments, ideas and concerns welcome.
> >>
> >> - Menno
> >>
> >>
> >>
> >> --
> >> Juju-dev mailing list
> >> Juju-dev@lists.ubuntu.com
> >> Modify settings or unsubscribe at:
> >> https://lists.ubuntu.com/mailman/listinfo/juju-dev
> >>
> >
> > --
> > gustavo @ http://niemeyer.net
>
>
>
> --
> gustavo @ http://niemeyer.net
>

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Re: Intentionally introducing failures into Juju

Reply via email to