[
https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839811#comment-17839811
]
Benjamin Mahler commented on MESOS-7187:
----------------------------------------
Added a mitigation of the bug I commented on above:
https://github.com/apache/mesos/pull/558
It does not fix the overall issue here due to a lack of a connection construct,
but it prevents the agent from getting stuck sending TASK_DROPPED for all
incoming tasks.
> Master can neglect to update agent metadata in a re-registration corner case.
> -----------------------------------------------------------------------------
>
> Key: MESOS-7187
> URL: https://issues.apache.org/jira/browse/MESOS-7187
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Mahler
> Priority: Major
> Labels: tech-debt
>
> If the agent is re-registering with the master for the first time, the master
> will drop any re-registration messages that arrive while the registry
> operation is in progress.
> These dropped messages can have different metadata (e.g. version,
> capabilities, etc) that gets dropped. Since the master doesn't distinguish
> between different instances of the agent (both share the same UPID and there
> is no instance identifying information), the master can't tell whether this
> is a retry from the original instance of the agent or a re-registration from
> a new instance of the agent.
> The following is an example:
> (1) Master restarts.
> (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES.
> (3) While registry operation is in progress, agent is upgraded and
> re-registers with NEW_VERSION / NEW_CAPABILITIES.
> (4) Registry operation completes, new agent receives the re-registration
> acknowledgement message and so, does not retry.
> (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the
> agent which remains inconsistent until a later re-registration occurs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)