YARN-2994. Document work-preserving RM restart. Contributed by Jian He.

Project: http://git-wip-us.apache.org/repos/asf/hadoop/repo
Commit: http://git-wip-us.apache.org/repos/asf/hadoop/commit/eb58025c
Tree: http://git-wip-us.apache.org/repos/asf/hadoop/tree/eb58025c
Diff: http://git-wip-us.apache.org/repos/asf/hadoop/diff/eb58025c

Branch: refs/heads/HDFS-EC
Commit: eb58025cadf8bf82963f99b64a7352a2e6b45f2f
Parents: a45ef2b
Author: Tsuyoshi Ozawa <oz...@apache.org>
Authored: Fri Feb 13 13:08:13 2015 +0900
Committer: Zhe Zhang <z...@apache.org>
Committed: Mon Feb 16 10:29:48 2015 -0800

----------------------------------------------------------------------
 hadoop-yarn-project/CHANGES.txt                 |   2 +
 .../src/site/apt/ResourceManagerRestart.apt.vm  | 182 ++++++++++++++-----
 2 files changed, 138 insertions(+), 46 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hadoop/blob/eb58025c/hadoop-yarn-project/CHANGES.txt
----------------------------------------------------------------------
diff --git a/hadoop-yarn-project/CHANGES.txt b/hadoop-yarn-project/CHANGES.txt
index 41e5411..622072f 100644
--- a/hadoop-yarn-project/CHANGES.txt
+++ b/hadoop-yarn-project/CHANGES.txt
@@ -83,6 +83,8 @@ Release 2.7.0 - UNRELEASED
      YARN-2616 [YARN-913] Add CLI client to the registry to list, view
      and manipulate entries. (Akshay Radia via stevel)
 
+    YARN-2994. Document work-preserving RM restart. (Jian He via ozawa)
+
   IMPROVEMENTS
 
     YARN-3005. [JDK7] Use switch statement for String instead of if-else

http://git-wip-us.apache.org/repos/asf/hadoop/blob/eb58025c/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
----------------------------------------------------------------------
diff --git 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
index 30a3a64..a08c19d 100644
--- 
a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
+++ 
b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/apt/ResourceManagerRestart.apt.vm
@@ -11,12 +11,12 @@
 ~~ limitations under the License. See accompanying LICENSE file.
 
   ---
-  ResourceManger Restart
+  ResourceManager Restart
   ---
   ---
   ${maven.build.timestamp}
 
-ResourceManger Restart
+ResourceManager Restart
 
 %{toc|section=1|fromDepth=0}
 
@@ -32,23 +32,26 @@ ResourceManger Restart
 
   ResourceManager Restart feature is divided into two phases:
 
-  ResourceManager Restart Phase 1: Enhance RM to persist application/attempt 
state
+  ResourceManager Restart Phase 1 (Non-work-preserving RM restart):
+  Enhance RM to persist application/attempt state
   and other credentials information in a pluggable state-store. RM will reload
   this information from state-store upon restart and re-kick the previously
   running applications. Users are not required to re-submit the applications.
 
-  ResourceManager Restart Phase 2:
-  Focus on re-constructing the running state of ResourceManger by reading back
-  the container statuses from NodeMangers and container requests from 
ApplicationMasters
+  ResourceManager Restart Phase 2 (Work-preserving RM restart):
+  Focus on re-constructing the running state of ResourceManager by combining
+  the container statuses from NodeManagers and container requests from 
ApplicationMasters
   upon restart. The key difference from phase 1 is that previously running 
applications
   will not be killed after RM restarts, and so applications won't lose its work
   because of RM outage.
 
+* {Feature}
+
+** Phase 1: Non-work-preserving RM restart
+
   As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is 
implemented which
   is described below.
 
-* {Feature}
-
   The overall concept is that RM will persist the application metadata
   (i.e. ApplicationSubmissionContext) in
   a pluggable state-store when client submits an application and also saves 
the final status
@@ -62,13 +65,13 @@ ResourceManger Restart
   applications if they were already completed (i.e. failed, killed, finished)
   before RM went down.
 
-  NodeMangers and clients during the down-time of RM will keep polling RM 
until 
+  NodeManagers and clients during the down-time of RM will keep polling RM 
until 
   RM comes up. When RM becomes alive, it will send a re-sync command to
-  all the NodeMangers and ApplicationMasters it was talking to via heartbeats.
-  Today, the behaviors for NodeMangers and ApplicationMasters to handle this 
command
+  all the NodeManagers and ApplicationMasters it was talking to via heartbeats.
+  As of Hadoop 2.4.0 release, the behaviors for NodeManagers and 
ApplicationMasters to handle this command
   are: NMs will kill all its managed containers and re-register with RM. From 
the
   RM's perspective, these re-registered NodeManagers are similar to the newly 
joining NMs. 
-  AMs(e.g. MapReduce AM) today are expected to shutdown when they receive the 
re-sync command.
+  AMs(e.g. MapReduce AM) are expected to shutdown when they receive the 
re-sync command.
   After RM restarts and loads all the application metadata, credentials from 
state-store
   and populates them into memory, it will create a new
   attempt (i.e. ApplicationMaster) for each application that was not yet 
completed
@@ -76,13 +79,33 @@ ResourceManger Restart
   applications' work is lost in this manner since they are essentially killed 
by
   RM via the re-sync command on restart.
 
-* {Configurations}
-
-  This section describes the configurations involved to enable RM Restart 
feature.
+** Phase 2: Work-preserving RM restart
+
+  As of Hadoop 2.6.0, we further enhanced RM restart feature to address the 
problem 
+  to not kill any applications running on YARN cluster if RM restarts.
+
+  Beyond all the groundwork that has been done in Phase 1 to ensure the 
persistency
+  of application state and reload that state on recovery, Phase 2 primarily 
focuses
+  on re-constructing the entire running state of YARN cluster, the majority of 
which is
+  the state of the central scheduler inside RM which keeps track of all 
containers' life-cycle,
+  applications' headroom and resource requests, queues' resource usage etc. In 
this way,
+  RM doesn't need to kill the AM and re-run the application from scratch as it 
is
+  done in Phase 1. Applications can simply re-sync back with RM and
+  resume from where it were left off.
+
+  RM recovers its runing state by taking advantage of the container statuses 
sent from all NMs.
+  NM will not kill the containers when it re-syncs with the restarted RM. It 
continues
+  managing the containers and send the container statuses across to RM when it 
re-registers.
+  RM reconstructs the container instances and the associated applications' 
scheduling status by
+  absorbing these containers' information. In the meantime, AM needs to 
re-send the
+  outstanding resource requests to RM because RM may lose the unfulfilled 
requests when it shuts down.
+  Application writers using AMRMClient library to communicate with RM do not 
need to
+  worry about the part of AM re-sending resource requests to RM on re-sync, as 
it is
+  automatically taken care by the library itself.
 
-  * Enable ResourceManager Restart functionality.
+* {Configurations}
 
-    To enable RM Restart functionality, set the following property in 
<<conf/yarn-site.xml>> to true:
+** Enable RM Restart.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Value                                
|
@@ -92,9 +115,10 @@ ResourceManger Restart
 
*--------------------------------------+--------------------------------------+ 
 
 
-  * Configure the state-store that is used to persist the RM state.
+** Configure the state-store for persisting the RM state.
 
-*--------------------------------------+--------------------------------------+
+
+*--------------------------------------*--------------------------------------+
 || Property                            || Description                        |
 *--------------------------------------+--------------------------------------+
 | <<<yarn.resourcemanager.store.class>>> | |
@@ -103,14 +127,36 @@ ResourceManger Restart
 | | 
<<<org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore>>> |
 | | , a ZooKeeper based state-store implementation and  |
 | | 
<<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>
 |
-| | , a Hadoop FileSystem based state-store implementation like HDFS. |
+| | , a Hadoop FileSystem based state-store implementation like HDFS and local 
FS. |
+| | 
<<<org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore>>>,
 |
+| | a LevelDB based state-store implementation. |
 | | The default value is set to |
 | | 
<<<org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore>>>.
 |
 
*--------------------------------------+--------------------------------------+ 
 
-    * Configurations when using Hadoop FileSystem based state-store 
implementation.
+** How to choose the state-store implementation.
+
+    <<ZooKeeper based state-store>>: User is free to pick up any storage to 
set up RM restart,
+    but must use ZooKeeper based state-store to support RM HA. The reason is 
that only ZooKeeper
+    based state-store supports fencing mechanism to avoid a split-brain 
situation where multiple
+    RMs assume they are active and can edit the state-store at the same time.
+
+    <<FileSystem based state-store>>: HDFS and local FS based state-store are 
supported. 
+    Fencing mechanism is not supported.
 
-      Configure the URI where the RM state will be saved in the Hadoop 
FileSystem state-store.
+    <<LevelDB based state-store>>: LevelDB based state-store is considered 
more light weight than HDFS and ZooKeeper
+    based state-store. LevelDB supports better atomic operations, fewer I/O 
ops per state update,
+    and far fewer total files on the filesystem. Fencing mechanism is not 
supported.
+
+** Configurations for Hadoop FileSystem based state-store implementation.
+
+    Support both HDFS and local FS based state-store implementation. The type 
of file system to
+    be used is determined by the scheme of URI. e.g. 
<<<hdfs://localhost:9000/rmstore>>> uses HDFS as the storage and
+    <<<file:///tmp/yarn/rmstore>>> uses local FS as the storage. If no
+    scheme(<<<hdfs://>>> or <<<file://>>>) is specified in the URI, the type 
of storage to be used is
+    determined by <<<fs.defaultFS>>> defined in <<<core-site.xml>>>.
+
+    Configure the URI where the RM state will be saved in the Hadoop 
FileSystem state-store.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
@@ -123,8 +169,8 @@ ResourceManger Restart
 | | <<conf/core-site.xml>> will be used. |
 
*--------------------------------------+--------------------------------------+ 
 
-      Configure the retry policy state-store client uses to connect with the 
Hadoop
-      FileSystem.
+    Configure the retry policy state-store client uses to connect with the 
Hadoop
+    FileSystem.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
@@ -137,9 +183,9 @@ ResourceManger Restart
 | | Default value is (2000, 500) |
 
*--------------------------------------+--------------------------------------+ 
 
-    * Configurations when using ZooKeeper based state-store implementation.
+** Configurations for ZooKeeper based state-store implementation.
   
-      Configure the ZooKeeper server address and the root path where the RM 
state is stored.
+    Configure the ZooKeeper server address and the root path where the RM 
state is stored.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
@@ -154,7 +200,7 @@ ResourceManger Restart
 | | Default value is /rmstore. |
 *--------------------------------------+--------------------------------------+
 
-      Configure the retry policy state-store client uses to connect with the 
ZooKeeper server.
+    Configure the retry policy state-store client uses to connect with the 
ZooKeeper server.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
@@ -175,7 +221,7 @@ ResourceManger Restart
 | | value is 10 seconds |
 *--------------------------------------+--------------------------------------+
 
-      Configure the ACLs to be used for setting permissions on ZooKeeper 
znodes.
+    Configure the ACLs to be used for setting permissions on ZooKeeper znodes.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
@@ -184,25 +230,69 @@ ResourceManger Restart
 | | ACLs to be used for setting permissions on ZooKeeper znodes. Default value 
is <<<world:anyone:rwcda>>> |
 *--------------------------------------+--------------------------------------+
 
-  * Configure the max number of application attempt retries.
+** Configurations for LevelDB based state-store implementation.
+
+*--------------------------------------+--------------------------------------+
+|| Property                            || Description                        |
+*--------------------------------------+--------------------------------------+
+| <<<yarn.resourcemanager.leveldb-state-store.path>>> | |
+| | Local path where the RM state will be stored. |
+| | Default value is <<<${hadoop.tmp.dir}/yarn/system/rmstore>>> |
+*--------------------------------------+--------------------------------------+
+
+
+**  Configurations for work-preserving RM recovery.
 
 *--------------------------------------+--------------------------------------+
 || Property                            || Description                        |
 *--------------------------------------+--------------------------------------+
-| <<<yarn.resourcemanager.am.max-attempts>>> | |
-| | The maximum number of application attempts. It's a global |
-| | setting for all application masters. Each application master can specify |
-| | its individual maximum number of application attempts via the API, but the 
|
-| | individual number cannot be more than the global upper bound. If it is, |
-| | the RM will override it. The default number is set to 2, to |
-| | allow at least one retry for AM. |
-*--------------------------------------+--------------------------------------+
-
-    This configuration's impact is in fact beyond RM restart scope. It controls
-    the max number of attempts an application can have. In RM Restart Phase 1,
-    this configuration is needed since as described earlier each time RM 
restarts,
-    it kills the previously running attempt (i.e. ApplicationMaster) and
-    creates a new attempt. Therefore, each occurrence of RM restart causes the
-    attempt count to increase by 1. In RM Restart phase 2, this configuration 
is not
-    needed since the previously running ApplicationMaster will
-    not be killed and the AM will just re-sync back with RM after RM restarts.
+| <<<yarn.resourcemanager.work-preserving-recovery.scheduling-wait-ms>>> | |
+| | Set the amount of time RM waits before allocating new |
+| | containers on RM work-preserving recovery. Such wait period gives RM a 
chance | 
+| | to settle down resyncing with NMs in the cluster on recovery, before 
assigning|
+| |  new containers to applications.|
+*--------------------------------------+--------------------------------------+
+
+* {Notes}
+
+  ContainerId string format is changed if RM restarts with work-preserving 
recovery enabled.
+  It used to be such format:
+
+   Container_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\}, 
e.g. Container_1410901177871_0001_01_000005.
+
+  It is now changed to:
+
+   
Container_<<e\{epoch\}>>_\{clusterTimestamp\}_\{appId\}_\{attemptId\}_\{containerId\},
 e.g. Container_<<e17>>_1410901177871_0001_01_000005.
+ 
+  Here, the additional epoch number is a
+  monotonically increasing integer which starts from 0 and is increased by 1 
each time
+  RM restarts. If epoch number is 0, it is omitted and the containerId string 
format
+  stays the same as before.
+
+* {Sample configurations}
+
+   Below is a minimum set of configurations for enabling RM work-preserving 
restart using ZooKeeper based state store.
+
++---+
+  <property>
+    <description>Enable RM to recover state after starting. If true, then 
+    yarn.resourcemanager.store.class must be specified</description>
+    <name>yarn.resourcemanager.recovery.enabled</name>
+    <value>true</value>
+  </property>
+
+  <property>
+    <description>The class to use as the persistent store.</description>
+    <name>yarn.resourcemanager.store.class</name>
+    
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
+  </property>
+
+  <property>
+    <description>Comma separated list of Host:Port pairs. Each corresponds to 
a ZooKeeper server
+    (e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002") to be used by the RM 
for storing RM state.
+    This must be supplied when using 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
+    as the value for yarn.resourcemanager.store.class</description>
+    <name>yarn.resourcemanager.zk-address</name>
+    <value>127.0.0.1:2181</value>
+  </property>
++---+
\ No newline at end of file

Reply via email to