[ https://issues.apache.org/jira/browse/MESOS-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Park updated MESOS-7487: -------------------------------- Description: Before 1.3.0, the master did not send a {{FrameworkInfo}} in the {{UpdateFrameworkMessage}}. In general, this means that a pre-1.3.0 agent will not have the {{FrameworkInfo}} updated when a framework changes their {{FrameworkInfo}}. In specific, if a framework upgrades into having a {{PARTITION_AWARE}} capability, the 1.1.x and 1.2.x agents will not be aware of the update, and incorrectly treat report {{TASK_LOST}} in some cases. Note that the run task path is okay since the master sends the new {{FrameworkInfo}}. The instances that are incorrect have the following check: {code} if (!protobuf::frameworkHasCapability( framework->info, // This is the one in agent memory! FrameworkInfo::Capability::PARTITION_AWARE)) {code} One solution is to backport the changes to {{UpdateFrameworkMessage}} to 1.1.x and 1.2.x, but only update the capabilities portion of the {{FrameworkInfo}}. If we update the entire {{FrameworkInfo}}, 1.1.x agent will run into an issue where it doesn't know how to deal with changes to {{FrameworkInfo.roles}}. Frameworks changing their roles is a 1.3.x feature. Note that 1.2.x agent can handle the role changes correctly because of {{Resource.allocation_info}} that was introduced in multi-role support in 1.2.x. Refer to MESOS-7460 for the potential issue with backporting to 1.1.x. was: Before 1.3.0, the master did not send a {{FrameworkInfo}} in the {{UpdateFrameworkMessage}}. In general, this means that a pre-1.3.0 agent will not have the {{FrameworkInfo}} updated when a framework changes their {{FrameworkInfo}}. In specific, if a framework upgrades into having a {{PARTITION_AWARE}} capability, the 1.1.x and 1.2.x agents will not be aware of the update, and incorrectly treat report {{TASK_LOST}} in some cases. Note that the run task path is okay since the master sends the new {{FrameworkInfo}}. The instances that are incorrect have the following check: {code} if (!protobuf::frameworkHasCapability( framework->info, // This is the one in agent memory! FrameworkInfo::Capability::PARTITION_AWARE)) {code} One solution is to backport the changes to {{UpdateFrameworkMessage}} to 1.1.x and 1.2.x, but only update the capabilities portion of the {{FrameworkInfo}}. If we update the entire {{FrameworkInfo}}, 1.1.x agent will run into an issue where it doesn't know how to deal with changes to {{FrameworkInfo.roles}}. Frameworks changing their roles is a 1.3.x feature. Note that 1.2.x agent can handle the role changes correctly because of {{Resource.allocation_info}} that was introduced in multi-role support in 1.2.x. Refer to MESOS-7460 for the potential issue with backporting to 1.1.x. > A framework upgrading into PARTITION_AWARE capability will continue to > receive TASK_LOST on old agents. > ------------------------------------------------------------------------------------------------------- > > Key: MESOS-7487 > URL: https://issues.apache.org/jira/browse/MESOS-7487 > Project: Mesos > Issue Type: Bug > Components: agent > Affects Versions: 1.1.0, 1.2.0 > Reporter: Michael Park > > Before 1.3.0, the master did not send a {{FrameworkInfo}} in the > {{UpdateFrameworkMessage}}. > In general, this means that a pre-1.3.0 agent will not have the > {{FrameworkInfo}} updated when > a framework changes their {{FrameworkInfo}}. In specific, if a framework > upgrades into having > a {{PARTITION_AWARE}} capability, the 1.1.x and 1.2.x agents will not be > aware of the update, > and incorrectly treat report {{TASK_LOST}} in some cases. > Note that the run task path is okay since the master sends the new > {{FrameworkInfo}}. > The instances that are incorrect have the following check: > {code} > if (!protobuf::frameworkHasCapability( > framework->info, // This is the one in agent memory! > FrameworkInfo::Capability::PARTITION_AWARE)) > {code} > One solution is to backport the changes to {{UpdateFrameworkMessage}} to > 1.1.x and 1.2.x, > but only update the capabilities portion of the {{FrameworkInfo}}. > If we update the entire {{FrameworkInfo}}, 1.1.x agent will run into an issue > where it doesn't know > how to deal with changes to {{FrameworkInfo.roles}}. Frameworks changing > their roles is a 1.3.x feature. > Note that 1.2.x agent can handle the role changes correctly because of > {{Resource.allocation_info}} > that was introduced in multi-role support in 1.2.x. > Refer to MESOS-7460 for the potential issue with backporting to 1.1.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346)