[ https://issues.apache.org/jira/browse/MESOS-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil Conway reassigned MESOS-6483: ---------------------------------- Assignee: Neil Conway > Check failure when a 1.1 master marking a 0.28 agent as unreachable > ------------------------------------------------------------------- > > Key: MESOS-6483 > URL: https://issues.apache.org/jira/browse/MESOS-6483 > Project: Mesos > Issue Type: Bug > Reporter: Megha > Assignee: Neil Conway > > When upgrading directly from mesos version 0.28 to a version > 1.0 there > could be a scenario that may make the > CHECK(frameworks.recovered.contains(frameworkId)) in > Master::_markUnreachable(..) to fail. The following sequence of events can > happen. > 1) The master gets upgraded first to the new version and the agent lets say X > is still at mesos version 0.28 > 2) This agent X (at mesos 0.28) attempts to re-registers with the master (at > lets say 1.1) and as a result doesn't send the frameworks (frameworkInfos) in > the ReRegisterSlave message since it wasn't available in the older mesos > version. > 3) Among other frameworks on this agent X, is a framework Y which didn’t > re-register after master’s failover. Since the master builds the > frameworks.recovered from the frameworkInfos that agents provide it so this > framework Y is neither in the recovered nor in registered frameworks. > 4) The agent X post re-registering fails master’s health check and is being > marked unreachable by the master. The check > CHECK(frameworks.recovered.contains(frameworkId)) will get fired for the > framework Y since it is neither in recovered or registered but has tasks > running on the agent X. -- This message was sent by Atlassian JIRA (v6.3.4#6332)