Author: stevel
Date: Tue Aug 12 13:19:35 2014
New Revision: 1617468

URL: http://svn.apache.org/r1617468
Log:
SLIDER-77 document windowed failure policy

Added:
    incubator/slider/site/trunk/content/release_notes/release-0.50.0.md
Modified:
    incubator/slider/site/trunk/content/developing/building.md
    incubator/slider/site/trunk/content/docs/manpage.md
    
incubator/slider/site/trunk/content/docs/slider_specs/resource_specification.md

Modified: incubator/slider/site/trunk/content/developing/building.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/developing/building.md?rev=1617468&r1=1617467&r2=1617468&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/developing/building.md (original)
+++ incubator/slider/site/trunk/content/developing/building.md Tue Aug 12 
13:19:35 2014
@@ -146,7 +146,7 @@ then
     git checkout -b apache/0.98
 or
 
-    git checkout tags/0.98.1
+    git checkout tags/0.98.4
     
 If you have already been building versions of HBase, remove the existing
 set of artifacts for safety:
@@ -165,7 +165,7 @@ property of`/pom.xml`:
 This will create an hbase `tar.gz` file in the directory 
`hbase-assembly/target/`
 in the hbase source tree. 
 
-    export HBASE_VERSION=0.98.1
+    export HBASE_VERSION=0.98.4
     
     pushd hbase-assembly/target
     gunzip hbase-$HBASE_VERSION-bin.tar.gz 

Modified: incubator/slider/site/trunk/content/docs/manpage.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/docs/manpage.md?rev=1617468&r1=1617467&r2=1617468&view=diff
==============================================================================
--- incubator/slider/site/trunk/content/docs/manpage.md (original)
+++ incubator/slider/site/trunk/content/docs/manpage.md Tue Aug 12 13:19:35 2014
@@ -332,11 +332,28 @@ Examples
     slider freeze instance2 --force --message "maintenance session"
 
 
-### `list <name>`
+### `list [name] [--live] [--history] `
 
-List running Slider application instances visible to the user.
+List Slider application instances visible to the user. "historical" instances
+are instances which YARN is aware of, but which are not currently running.
+Live instances are application instances which are running —there can be
+at most one of these with a specific name
 
-If an instance name is given and there is no running instance with that name, 
an error is returned. 
+If no instance name is specified, all instances matching the criteria are 
listed.
+
+1. `--live` indicates live instances are to be listed
+1. `--history` indicates that historical instances are to be listed
+
+The default is: list all (equivalent to `--live --history`)
+
+If an instance name is given, then that instance must exist.
+
+1. If `--live` is set then the application must be live. This is the default
+policy.
+1. If `--history` is set then the instance must be in the historical list.
+ 
+If there is no running instance with that name, an error is returned.
+ 
 
 Example
 

Modified: 
incubator/slider/site/trunk/content/docs/slider_specs/resource_specification.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/docs/slider_specs/resource_specification.md?rev=1617468&r1=1617467&r2=1617468&view=diff
==============================================================================
--- 
incubator/slider/site/trunk/content/docs/slider_specs/resource_specification.md 
(original)
+++ 
incubator/slider/site/trunk/content/docs/slider_specs/resource_specification.md 
Tue Aug 12 13:19:35 2014
@@ -34,6 +34,8 @@ Sample:
       "metadata" : {
       },
       "global" : {
+        "yarn.container.failure.threshold":"10",
+        "yarn.container.failure.window.hours":"1"
       },
       "components" : {
         "HBASE_MASTER" : {
@@ -51,3 +53,68 @@ Sample:
       }
     }
 
+## Container Failure Policy
+
+YARN containers hosting component instances  may fail. This can happen because 
of
+
+1. A problem in the configuration of the instance.
+1. A problem in the app package
+1. A problem (hardware, software or networking) in the server hosting the 
container
+1. Conflict for a resource (usually a network port) between the component 
instance
+and another running program.
+1. The server or the network connection to it failing.
+1. The server being taken down for maintenance.
+
+Slider reacts to a failed container by requesting a new container from YARN,
+preferably on a host that has already hosted an instance of that role. Once
+the container is allocated, slider will redeploy an instance of the component.
+As it may time for YARN to have the resources to allocate the container, it
+may take some time for the replacement to be instantiated.
+
+Slider tracks failures in an attempt to differentiate problems in the
+application package or its configuration from those of the underlying servers.
+If a a component fails *too many times* then slider considers the application
+itself as failing, and halts.
+
+This leads to the question: what is too many times?
+
+The limits are defined in `resources.json`:
+1. The duration of a failure window, a time period in which failures are 
counted.
+This duration can span days.
+1. The maximum number of failures of any component in this time period.
+
+
+The parameters defining the failure policy are as follows.
+
+* `yarn.container.failure.threshold`
+
+The  threshold for failures. If set to "0" there are no limits on
+the number of times containers may fail.
+
+
+* `yarn.container.failure.window.days`, `yarn.container.failure.window.hours`
+and ``yarn.container.failure.window.minutes`
+
+These properties define the duration of the window; they are all combined
+so the window is, in minutes:
+
+    minutes + 60 * (hours + days * 24)
+
+The initial window is measured from the
+start of the application master —once the duration of that window
+is exceeded, all failure counts are reset, and the window begins again.
+ 
+If the AM itself fails, the failure counts are reset and and the window is
+restarted.
+
+### Recommended values
+
+We recommend having a duration of a few hours for the window, and a failure 
limit
+*larger than the number of instances of that component*. 
+
+Why? 
+
+This will cover the loss of a large portion of the hardware of the cluster by
+trying to reinstantiate all the components. Yet, if a component does fail
+repeatedly, eventually slider will conclude that there is a problem and fail
+with the exit code 73, `EXIT_DEPLOYMENT_FAILED`. 

Added: incubator/slider/site/trunk/content/release_notes/release-0.50.0.md
URL: 
http://svn.apache.org/viewvc/incubator/slider/site/trunk/content/release_notes/release-0.50.0.md?rev=1617468&view=auto
==============================================================================
--- incubator/slider/site/trunk/content/release_notes/release-0.50.0.md (added)
+++ incubator/slider/site/trunk/content/release_notes/release-0.50.0.md Tue Aug 
12 13:19:35 2014
@@ -0,0 +1,51 @@
+<!---
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+  
+# Apache Slider Release 0.50.0 (incubating)
+
+August 2014
+
+This release is built against the Apache Hadoop 2.4.1
+Download: []()
+
+
+## Key changes
+
+1.
+
+## Incompatible Changes
+
+### [SLIDER-77](https://issues.apache.org/jira/browse/SLIDER-77): use a window 
for tracking container failures.
+
+Previously a simple threshold, `"internal.container.failure.threshold"` set the
+limit for the number of container failures tolerated for the life of an 
application.
+
+This has now been reworked to support
+# a time-bounded window for failures
+# placement in `resources.json` as `"yarn.container.failure.threshold"`
+# reset/changing during cluster flex
+# Configurable in a combination of days, hours and minutes.
+
+This is a major change —and is necessary to support long-lived applications 
with
+a slow failure rate, while still detecting and reacting to the situation where
+many containers are failing in a short period of time.
+
+Because the property name has changed, any cluster where this had been changed 
from
+the default (which is still five) will not pick up the changes. Please use the
+new name and set the value in the global section of `resources.json`
+
+## Other changes


Reply via email to