[jira] [Commented] (FLINK-4141) TaskManager failures not always recover when killed during an ApplicationMaster failure in HA mode on Yarn

ASF GitHub Bot (JIRA) Fri, 01 Jul 2016 08:00:37 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359090#comment-15359090
 ]


ASF GitHub Bot commented on FLINK-4141:
---------------------------------------

GitHub user mxm opened a pull request:

    https://github.com/apache/flink/pull/2190

    [FLINK-4141] remove leaderUpdated() method from ResourceManager

    This removes the leaderUpdated method from the framework. Further it
    lets the RM client thread communicate directly with the
    ResourceManager actor. This is fine since the two are always spawned
    together. Failures of the ResourceManager actor will lead to dropped
    messages of the RM client thread. Failures of the RM client thread will
    inform the JobManager.
    
    The leaderUpdated() method was used to signal the ResourceManager
    framework that a new leader was elected. However, the method was not
    always called when the leader changed, only when a new leader was
    elected. This dropped all messages from the async Yarn RM client
    thread (YarnResourceManagerCallbackHandler) for the time that the old
    leader had failed and no new leader had been elected. The Yarn RM client
    thread used leader tagged messages to communicate with the main Flink
    ResourceManager actor.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mxm/flink FLINK-4141

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2190.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2190
    
----
commit c758121b9e5e2d7de8318bd529aa5da88ed424c6
Author: Maximilian Michels <[email protected]>
Date:   2016-07-01T14:27:18Z

    [FLINK-4141] remove leaderUpdated() method from ResourceManager
    
    This removes the leaderUpdated method from the framework. Further it
    lets the RM client thread communicate directly with the
    ResourceManager actor. This is fine since the two are always spawned
    together. Failures of the ResourceManager actor will lead to dropped
    messages of the RM client thread. Failures of the RM client thread will
    inform the JobManager.
    
    The leaderUpdated() method was used to signal the ResourceManager
    framework that a new leader was elected. However, the method was not
    always called when the leader changed, only when a new leader was
    elected. This dropped all messages from the async Yarn RM client
    thread (YarnResourceManagerCallbackHandler) for the time that the old
    leader had failed and no new leader had been elected. The Yarn RM client
    thread used leader tagged messages to communicate with the main Flink
    ResourceManager actor.

----


> TaskManager failures not always recover when killed during an 
> ApplicationMaster failure in HA mode on Yarn
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-4141
>                 URL: https://issues.apache.org/jira/browse/FLINK-4141
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.0.3
>            Reporter: Stefan Richter
>
> High availability on Yarn often fails to recover in the following test 
> scenario:
> 1. Kill application master process.
> 2. Then, while application master is recovering, randomly kill several task 
> managers (with some delay).
> After the application master recovered, not all the killed task manager are 
> brought back and no further attempts are made the restart them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4141) TaskManager failures not always recover when killed during an ApplicationMaster failure in HA mode on Yarn

Reply via email to