[
https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050266#comment-17050266
]
Andrei Sekretenko edited comment on MESOS-7639 at 3/3/20 2:46 PM:
------------------------------------------------------------------
After converting ACCEPT to synchronous authorization (MESOS-10056), the
particular scenario described in this ticket is no longer possible, because
nothing is pending between 'accept()' and '_accept()' anymore.
Closing this ticket.
was (Author: asekretenko):
After converting ACCEPT to synchronous authorization, the particular scenario
described in this ticket is no longer possible, because nothing is pending
between 'accept()' and '_accept()' anymore.
Closing this ticket.
> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
> Key: MESOS-7639
> URL: https://issues.apache.org/jira/browse/MESOS-7639
> Project: Mesos
> Issue Type: Bug
> Components: allocation
> Reporter: Yan Xu
> Priority: Major
>
> As I described in MESOS-7566, the following scenario is possible when the
> agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call
> {{HierarchicalAllocatorProcess::updateSlave}}, followed by
> {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update
> {{roleSorter.total_}} to reduce to total so the total could go below the
> allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to
> remove outstanding allocation may fail to reduce it to below the total
> because some allocation may not be in outstanding offers. It could be in
> offered resources pending between {{Master::accept}} and {{Master::_accept}}.
> So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call
> {{allocator->updateAllocation}}, in which the {{total < allocation}}
> condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}}
> or tracked in the allocator when {{Master::updateSlave}} is called.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)