[
https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385711#comment-15385711
]
ASF GitHub Bot commented on FLINK-4152:
---------------------------------------
Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2257
Hi @mxm, I've changed the implementation such that we no longer need the
`containersLaunched` map in the `YarnFlinkResourceManager`. Instead we're not
clearing the `registeredWorkers` map in the `FlinkResourceManager` when the
`JobManager` loses leadership. Thus, the `registeredWorkers` field denotes the
successfully started task managers (and the containers they are running in).
Additionally I reintroduced the reconnect resource manager functionality in
the job manager. This should make sure that the resource manager is eventually
notified about newly registered resources. In the current implementation,
however, the resource manager will always accept the register resource
messages. So only if the message gets lost and thus triggers a timeout
exception, the reconnect resource manager message is sent.
Would be great if you could take another look at the changes.
> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>
> Key: FLINK-4152
> URL: https://issues.apache.org/jira/browse/FLINK-4152
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination, TaskManager, YARN Client
> Reporter: Robert Metzger
> Assignee: Till Rohrmann
> Attachments: logs.tgz
>
>
> While testing Flink 1.1 I've found that the TaskManagers are logging many
> messages when registering at the JobManager.
> This is the log file:
> https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that
> this is the expected behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)