Author: stevel
Date: Wed Aug 13 18:14:07 2014
New Revision: 1617787

URL: http://svn.apache.org/r1617787
Log:
SLIDER-202: Chaos monkey documentation

Added:
    incubator/slider/site/trunk/content/docs/slider_specs/chaosmonkey.md
Modified:
    incubator/slider/site/trunk/content/docs/slider_specs/index.md
    incubator/slider/site/trunk/content/release_notes/release-0.50.0.md

Added: incubator/slider/site/trunk/content/docs/slider_specs/chaosmonkey.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/docs/slider_specs/chaosmonkey.md?rev=1617787&view=auto
==============================================================================
--- incubator/slider/site/trunk/content/docs/slider_specs/chaosmonkey.md (added)
+++ incubator/slider/site/trunk/content/docs/slider_specs/chaosmonkey.md Wed 
Aug 13 18:14:07 2014
@@ -0,0 +1,157 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+  
+   http://www.apache.org/licenses/LICENSE-2.0
+  
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+  
+# Slider Chaos Monkey
+
+Slider includes a built in [Chaos 
Monkey](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html).
+This is a service which runs inside the Slider Application Master, and randomly
+kills containers without any warning, or even the application master itself.
+
+This Chaos Monkey is intended for testing. Netflix's original design runs
+against all their production applications hosted in Amazon's cloud. The slider
+Chaos Monkey also may be used in production, if desired, though it is not 
something
+we would recommend, especially for any application where the cost of
+reacting to or recovering from failures is tangible. 
+
+If used in production, note that YARN needs to be configured
+to tolerate the failure rate, and the Slider failure threshold 
(`yarn.container.failure.threshold`)
+and window are configured to tolerate the increased failure rate.
+
+
+The Chaos Monkey works as follows
+
+1. The monkey wakes up at a configured interval "comes out to play".
+1. For each chaos action, the monkey then generates a random number
+1. If the probability of the chaos action is greater than this random number,
+the action is performed.
+
+As an example, if the probability of killing the AM is set to 50%, and the
+monkey interval is set to one hour, then one would expect over a 48 hour period
+for the AM to have failed approximately 24 times. As the check is random, it 
is unlikey
+to be exactly this value, nor will the interval between failures be exactly 
two hours.
+
+A monkey interval of 30 minutes and an AM kill probability of 25% would result
+in the aggregate failure rate being approximately the same, but the interval 
between failures
+would be different.
+
+
+## Configuring the Chaos Monkey
+
+The Chaos Monkey is configured on a per-application basis, by setting options
+in the `global` section of the internal resources file, `internal.json`
+
+Any option which takes a probability uses unit of hundreths of a percentage,
+that is `10000` units are equivalent to a probability of `1`: a operation will 
always
+take place. A value of `100` is translated to 1%, a probability of 0.01. 
+
+This unit is used to allow very small percentages to be expressed without 
resorting
+to floating point numbers.
+
+the monkey can be enabled, after which the interval between checks must be set.
+Available actions can then have their individidual probability set.
+
+### Enabling the Monkey
+
+The option `internal.chaos.monkey.enabled` enables or disables the monkey; it
+must equal `"true"` for the monkey to be enabled and other options read.
+
+### Interval
+
+The interval (aggregated to produce a total interval) between checks to see if
+*any* chaos action is to be triggered.
+
+
+* `internal.chaos.monkey.rate.days`,
+* `internal.chaos.monkey.rate.hours`
+* `internal.chaos.monkey.rate.minutes`
+* `internal.chaos.monkey.rate.seconds`
+
+
+## Application Master Kill
+
+The probability of the AM being killed on a monkey check is:
+
+    internal.chaos.monkey.probability.amfailure
+
+When the monkey triggers this action, the AM kills itself. YARN is expected to
+detect this and react by creating a new application master —while leaving the
+running application itself to continue uninterrupted.
+ 
+For the AM to recover from failures, YARN must be configured to support 
application
+retries.
+
+As a restarted AM resets all its internal state, the Chaos Monkey itself will 
be
+restarted with a new interval *which begins from the moment the AM is 
restarted*.
+
+
+# Container Kill
+
+The probability of a container being killed in a single monkey "play" is:
+
+    internal.chaos.monkey.probability.containerfailure
+    
+When the monkey triggers this action, the current list of active YARN 
containers
+being used by the application is enumerated, then one of the containers is 
selected
+at random to be killed.
+
+The Slider Application Master is expected to notice this event and respond by
+requesting and re-instantiating a replacement failure.
+  
+The Slider Application should be configured in its `resources.json` file to
+tolerate a failure rate.
+
+If there are no containers hosting application components at the time
+the chaos monkey performs its actions, then no container will be killed.
+
+### Example
+
+
+A disabled Chaos Monkey
+
+    {
+      "internal.chaos.monkey.enabled":" false"
+    }
+
+As this is the default, it does not need to be declared.
+
+    {
+      "internal.chaos.monkey.enabled": "true",
+      "internal.chaos.monkey.rate.hours": "1",
+      "internal.chaos.monkey.rate.minutes": "30",
+      "internal.chaos.monkey.probability.containerfailure": "1000",
+      "internal.chaos.monkey.probability.amfailure": "5"
+    }
+
+This configuration
+
+1. Enables the Chaos Monkey
+1. Set the interval to 1h 30m; 90 minutes.
+1. Sets the probability of a container failure to 10%
+1. Sets the probability of an application master failure to 0.05%
+
+With these values, over an 24 hour period, the probability of a container
+being killed is `16 * 1000 / 10000 `: `1.6`, 
+
+That is, at least one container is likely to have been killed over the day.
+
+The probability of the AM failing is significantly lower
+
+    16 * 0.0005 = 0.008 = 0.8%
+
+If any probability is set to zero, such as: 
+
+    "internal.chaos.monkey.probability.amfailure": "0"
+
+Then that check is never made —here the AM will never be killed by
+the Chaos Monkey.
\ No newline at end of file

Modified: incubator/slider/site/trunk/content/docs/slider_specs/index.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/docs/slider_specs/index.md?rev=1617787&r1=1617786&r2=1617787&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/docs/slider_specs/index.md (original)
+++ incubator/slider/site/trunk/content/docs/slider_specs/index.md Wed Aug 13 
18:14:07 2014
@@ -50,4 +50,5 @@ The entry points to leverage Slider are:
 - [Guidelines for Clients and Client Applications](canonical_scenarios.html)
 - [Specifications for Configuration](application_configuration.html) Default 
application configuration?
 - [Documentation for "General Developer Guidelines"](/developing/index.html)
+* [Configuring the Slider Chaos Monkey](slider_specs/chaosmonkey.md)
                

Modified: incubator/slider/site/trunk/content/release_notes/release-0.50.0.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/release_notes/release-0.50.0.md?rev=1617787&r1=1617786&r2=1617787&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/release_notes/release-0.50.0.md 
(original)
+++ incubator/slider/site/trunk/content/release_notes/release-0.50.0.md Wed Aug 
13 18:14:07 2014
@@ -25,7 +25,11 @@ Download: []()
 
 ## Key changes
 
-1.
+1. Slider now has an integral Chaos Monkey  
[SLIDER-202](https://issues.apache.org/jira/browse/SLIDER-202).
+This can be configured to start through options in `internal.json`; it will 
kill
+a random container or the AM itself based on configured properties. This is 
intended
+for use in testing —though may be used in production if desired, and if the 
+application and YARN cluster configured to tolerate the failures.
 
 ## Incompatible Changes
 


Reply via email to