GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/1213

    [FLINK-2790] [yarn] [ha] Adds high availability support for Yarn

    Adds high availability support for Yarn by exploiting Yarn's functionality 
to restart a failed application master. Depending on the Hadoop version the 
behaviour is an increasing superset of functionalities of the preceding 
version's behaviour
    
    ###2.3.0 <= version < 2.4.0
    
    * Set the number of application attempts to the configuration value 
`yarn.application-attempts`. This means that the application can be restarted 
`yarn.application-attempts` time before yarn fails the application. In case of 
an application master, all other task manager containers will also be killed.
    
    ### 2.4.0 <= version < 2.6.0
    
    * Additionally, enables that containers will be kept across application 
attempts. This avoids the killing of TaskManager containers in the case of an 
application master failure. This has the advantage that the startup time is 
faster and that the user does not have to wait for obtaining the container 
resources again.
    
    ### 2.6.0 <= version
    
    * Sets the attempts failure validity interval to the akka timeout value. 
The attempts failure validity interval says that an application is only killed 
after the system has seen the maximum number of application attempts during one 
interval. This avoids that a long lasting job will deplete it's application 
attempts.
    
    This PR also refactors the different Yarn components to allow the start-up 
of testing actors within Yarn. Furthermore, the `JobManager` start up logic is 
slightly extended to allow code reuse in the `ApplicationMasterBase`.
    
    The HA functionality is tested via the `YARNHighAvailabilityITCase` where 
an application master is multiple times killed. Each time it's checked that the 
single TaskManager successfully reconnects to the newly started 
`YarnJobManager`. In case of version `2.3.0`, the `TaskManager` is restarted.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink yarnHA

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1213
    
----
commit 1a18172ae69eb576638704f8e143a921aa8630d5
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-09-01T14:35:48Z

    [FLINK-2790] [yarn] [ha] Adds high availability support for Yarn

commit 5359676556d16610303d4f36fcbe5320ef4e6643
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-09-23T15:42:57Z

    Refactors JobManager's start actors method to be reusable

commit d6a47cd8ad265c5122d1a67c09773cbc5a491261
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-09-24T12:55:18Z

    Yarn refactoring to introduce yarn testing functionality

commit f9578f136dd41cd9829d712f7c62a59c9ea8e194
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-09-28T16:21:30Z

    Added support for testing yarn cluster. Extracted JobManager's and 
TaskManager's testing messages into stackable traits.

commit dbfa16438ad9d7d61e8d1a582c8cd1de9352078e
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2015-09-29T15:05:01Z

    Implemented YarnHighAvailabilityITCase using Akka messages for 
synchronization.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to