I like the idea of being able to trigger failures stochastically. I'll integrate this into whatever we settle on for Juju's failure injection.
On 14 August 2014 02:29, Gustavo Niemeyer <gustavo.nieme...@canonical.com> wrote: > Ah, and one more thing: when developing the chaos-injection mechanism > in the mgo/txn package, I also added both a "chance" parameter for > either killing or slowing down a given breakpoint. It sounds like it > would be useful for juju's mechanism too. If you kill every time, it's > hard to tell whether the system would know how to retry properly. > Killing or slowing down just sometimes, or perhaps the first 2 times > out of every 3, for example, would enable the system to recover > itself, and an external agent to ensure it continues to work properly. > > On Wed, Aug 13, 2014 at 11:25 AM, Gustavo Niemeyer > <gustavo.nieme...@canonical.com> wrote: > > That's a nice direction, Menno. > > > > The main thing that comes to mind is that it sounds quite inconvenient > > to turn the feature on. It may sound otherwise because it's so easy to > > drop files at arbitrary places in our local machines, but when dealing > > with a distributed system that knows how to spawn its own resources > > up, suddenly the "just write a file" becomes surprisingly boring and > > race prone. > > > > What about: > > > > juju inject-failure [--unit=unit] [--service=service] <failure > name>"? > > juju deploy [--inject-failure=name] ... > > > > > > > > On Wed, Aug 13, 2014 at 7:17 AM, Menno Smits <menno.sm...@canonical.com> > wrote: > >> There's been some discussion recently about adding some feature to Juju > to > >> allow developers or CI tests to intentionally trigger otherwise hard to > >> induce failures in specific parts of Juju. The idea is that sometimes we > >> need some kind of failure to happen in a CI test or when manually > testing > >> but those failures can often be hard to make happen. > >> > >> For example, for changes Juju's upgrade mechanics that I'm working on > at the > >> moment I would like to ensure that an upgrade is cleanly aborted if one > of > >> the state servers in a HA environment refuses to start the upgrade. This > >> logic is well unit tested but there's nothing like seeing it actually > work > >> in a real environment to build confidence - however, it isn't easy to > make a > >> state server misbehave in this way. > >> > >> To help with this kind of testing scenario, I've created a new top-level > >> package called "wrench" which lets us "drop a wrench in the works" so to > >> speak. It's very simple with one main API which can be called from > >> judiciously chosen points in Juju's execution to decide whether some > failure > >> should be triggered. > >> > >> The module looks for files in $jujudatadir/wrench (typically > >> /var/lib/juju/wrench) on the local machine. If I wanted to trigger the > >> upgrade failure described above I could drop a file in that directory > on one > >> of the state servers named say "machine-agent" with the content: > >> > >> refuse-upgrade > >> > >> Then in some part of jujud's upgrade code there could be a check like: > >> > >> if wrench.IsActive("machine-agent", "refuse-upgrade") { > >> // trigger the failure > >> } > >> > >> The idea is this check would be left in the code to aid CI tests and > future > >> manual tests. > >> > >> You can see the incomplete wrench package here: > >> https://github.com/juju/juju/pull/508 > >> > >> There are a few issues to nut out. > >> > >> 1. It needs to be difficult/impossible for someone to accidentally or > >> maliciously activate this feature, especially in production > environments. I > >> have almost finished (but not pushed to Github) some changes to the > wrench > >> package which make it strict about the ownership and permissions on the > >> wrench files. This should make it harder for the wrong person to drop > files > >> in to the wrench directory. > >> > >> The idea has also been floated to only enable this functionality in > >> non-stable builds. This certainly gives a good level of protection but > I'm > >> slightly wary of this approach because it makes it impossible for CI to > take > >> advantage of the wrench feature when testing stable release builds. I'm > >> happy to be convinced that the benefit is worth the cost. > >> > >> Other ideas on how to better handle this are very welcome. > >> > >> 2. The wrench functionality needs to be disabled during unit test runs > >> because we don't want any wrench files a developer may have lying > around to > >> affect Juju's behaviour during test runs. The wrench package has a > global > >> on/off switch so I plan on switching it off in BaseSuite's setup or > similar. > >> > >> 3. The name is a bikeshedding magnet :) Other names that have been > bandied > >> about for this feature are "chaos" and "spanner". I don't care too much > so > >> if there's a strong consensus for another name let's use that. I chose > >> "wrench" over "spanner" because I believe that's the more common usage > in > >> the US and because Spanner is a DB from Google. Let's not get carried > >> away... > >> > >> All comments, ideas and concerns welcome. > >> > >> - Menno > >> > >> > >> > >> -- > >> Juju-dev mailing list > >> Juju-dev@lists.ubuntu.com > >> Modify settings or unsubscribe at: > >> https://lists.ubuntu.com/mailman/listinfo/juju-dev > >> > > > > -- > > gustavo @ http://niemeyer.net > > > > -- > gustavo @ http://niemeyer.net >
-- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev