[ https://issues.apache.org/jira/browse/FLINK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359090#comment-15359090 ]
ASF GitHub Bot commented on FLINK-4141: --------------------------------------- GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/2190 [FLINK-4141] remove leaderUpdated() method from ResourceManager This removes the leaderUpdated method from the framework. Further it lets the RM client thread communicate directly with the ResourceManager actor. This is fine since the two are always spawned together. Failures of the ResourceManager actor will lead to dropped messages of the RM client thread. Failures of the RM client thread will inform the JobManager. The leaderUpdated() method was used to signal the ResourceManager framework that a new leader was elected. However, the method was not always called when the leader changed, only when a new leader was elected. This dropped all messages from the async Yarn RM client thread (YarnResourceManagerCallbackHandler) for the time that the old leader had failed and no new leader had been elected. The Yarn RM client thread used leader tagged messages to communicate with the main Flink ResourceManager actor. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink FLINK-4141 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2190.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2190 ---- commit c758121b9e5e2d7de8318bd529aa5da88ed424c6 Author: Maximilian Michels <m...@apache.org> Date: 2016-07-01T14:27:18Z [FLINK-4141] remove leaderUpdated() method from ResourceManager This removes the leaderUpdated method from the framework. Further it lets the RM client thread communicate directly with the ResourceManager actor. This is fine since the two are always spawned together. Failures of the ResourceManager actor will lead to dropped messages of the RM client thread. Failures of the RM client thread will inform the JobManager. The leaderUpdated() method was used to signal the ResourceManager framework that a new leader was elected. However, the method was not always called when the leader changed, only when a new leader was elected. This dropped all messages from the async Yarn RM client thread (YarnResourceManagerCallbackHandler) for the time that the old leader had failed and no new leader had been elected. The Yarn RM client thread used leader tagged messages to communicate with the main Flink ResourceManager actor. ---- > TaskManager failures not always recover when killed during an > ApplicationMaster failure in HA mode on Yarn > ---------------------------------------------------------------------------------------------------------- > > Key: FLINK-4141 > URL: https://issues.apache.org/jira/browse/FLINK-4141 > Project: Flink > Issue Type: Bug > Affects Versions: 1.0.3 > Reporter: Stefan Richter > > High availability on Yarn often fails to recover in the following test > scenario: > 1. Kill application master process. > 2. Then, while application master is recovering, randomly kill several task > managers (with some delay). > After the application master recovered, not all the killed task manager are > brought back and no further attempts are made the restart them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)