GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/1213
[FLINK-2790] [yarn] [ha] Adds high availability support for Yarn Adds high availability support for Yarn by exploiting Yarn's functionality to restart a failed application master. Depending on the Hadoop version the behaviour is an increasing superset of functionalities of the preceding version's behaviour ###2.3.0 <= version < 2.4.0 * Set the number of application attempts to the configuration value `yarn.application-attempts`. This means that the application can be restarted `yarn.application-attempts` time before yarn fails the application. In case of an application master, all other task manager containers will also be killed. ### 2.4.0 <= version < 2.6.0 * Additionally, enables that containers will be kept across application attempts. This avoids the killing of TaskManager containers in the case of an application master failure. This has the advantage that the startup time is faster and that the user does not have to wait for obtaining the container resources again. ### 2.6.0 <= version * Sets the attempts failure validity interval to the akka timeout value. The attempts failure validity interval says that an application is only killed after the system has seen the maximum number of application attempts during one interval. This avoids that a long lasting job will deplete it's application attempts. This PR also refactors the different Yarn components to allow the start-up of testing actors within Yarn. Furthermore, the `JobManager` start up logic is slightly extended to allow code reuse in the `ApplicationMasterBase`. The HA functionality is tested via the `YARNHighAvailabilityITCase` where an application master is multiple times killed. Each time it's checked that the single TaskManager successfully reconnects to the newly started `YarnJobManager`. In case of version `2.3.0`, the `TaskManager` is restarted. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink yarnHA Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1213 ---- commit 1a18172ae69eb576638704f8e143a921aa8630d5 Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-09-01T14:35:48Z [FLINK-2790] [yarn] [ha] Adds high availability support for Yarn commit 5359676556d16610303d4f36fcbe5320ef4e6643 Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-09-23T15:42:57Z Refactors JobManager's start actors method to be reusable commit d6a47cd8ad265c5122d1a67c09773cbc5a491261 Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-09-24T12:55:18Z Yarn refactoring to introduce yarn testing functionality commit f9578f136dd41cd9829d712f7c62a59c9ea8e194 Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-09-28T16:21:30Z Added support for testing yarn cluster. Extracted JobManager's and TaskManager's testing messages into stackable traits. commit dbfa16438ad9d7d61e8d1a582c8cd1de9352078e Author: Till Rohrmann <trohrm...@apache.org> Date: 2015-09-29T15:05:01Z Implemented YarnHighAvailabilityITCase using Akka messages for synchronization. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---