[
https://issues.apache.org/jira/browse/FLINK-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375296#comment-15375296
]
Till Rohrmann commented on FLINK-4152:
--------------------------------------
[~mxm]The restarted registration attempts are the observable symptoms caused by
a different problem.
The actual problem is that the {{YarnFlinkRessourceManager}} forgets about the
registered task managers if the job manager loses its leadership. Each task
manager has a resource ID with which it registers at the resource manager. The
{{YarnFlinkResourceManager}} has two states for allocated resources:
{{containersInLaunch}} and {{registeredWorkers}}. A container can only go from
{{containersInLaunch}} to {{registeredWorkers}}. This also works for the
initial registration. However, when the job manager loses its leadership and
the {{registeredWorkers}} list is cleared, there is no longer an container in
launch associated with the respective resource ID. Consequently, when the old
task manager is being re-registered by the new leader, the registration is
rejected.
This rejection is then sent to the task manager. Upon receiving a rejection,
the task manager reschedules another registration attempt after waiting for
some time. Here the problem is that the old registration attempts are not
cancelled. Consequently, one will have multiple registration attempts taking
place at the "same" time/concurrently. That's the reason why you observe many
registration attempt messages in the log.
I think the symptom can be fixed by cancelling all currently active
registration attempts when you want to restart the registration.
It is a bit unclear to me what the expected behaviour of the
FlinkYarnResourceManager should be. In the {{jobManagerLostLeadership}} method
where the {{registeredWorkers}} list is cleared, a comment says "all currently
registered TaskManagers are put under "awaiting registration"". But there is no
such state. Furthermore, I'm not sure whether registered TaskManagers have to
re-register if only the job manager has failed.
Thus, I see two solutions. Either not clearing {{registeredWorkers}} or
introducing a new state "awaiting registration" which keeps all formerly
registered task managers which can be re-registered.
Maybe [~mxm] can give some input.
> TaskManager registration exponential backoff doesn't work
> ---------------------------------------------------------
>
> Key: FLINK-4152
> URL: https://issues.apache.org/jira/browse/FLINK-4152
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination, TaskManager, YARN Client
> Reporter: Robert Metzger
> Assignee: Till Rohrmann
> Attachments: logs.tgz
>
>
> While testing Flink 1.1 I've found that the TaskManagers are logging many
> messages when registering at the JobManager.
> This is the log file:
> https://gist.github.com/rmetzger/0cebe0419cdef4507b1e8a42e33ef294
> Its logging more than 3000 messages in less than a minute. I don't think that
> this is the expected behavior.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)