[ 
https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17050266#comment-17050266
 ] 

Andrei Sekretenko edited comment on MESOS-7639 at 3/3/20 2:46 PM:
------------------------------------------------------------------

After converting ACCEPT to synchronous authorization (MESOS-10056), the 
particular scenario described in this  ticket is no longer possible, because 
nothing is pending between 'accept()' and '_accept()' anymore.

Closing this ticket.


was (Author: asekretenko):
After converting ACCEPT to synchronous authorization, the particular scenario 
described in this  ticket is no longer possible, because nothing is pending 
between 'accept()' and '_accept()' anymore.

Closing this ticket.

> Oversubscription could crash the master due to CHECK failure in the allocator
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7639
>                 URL: https://issues.apache.org/jira/browse/MESOS-7639
>             Project: Mesos
>          Issue Type: Bug
>          Components: allocation
>            Reporter: Yan Xu
>            Priority: Major
>
> As I described in MESOS-7566, the following scenario is possible when the 
> agent sends updated oversubscribed resources to the master:
> - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
> - {{Master::updateSlave}} upon receiving the update would first call 
> {{HierarchicalAllocatorProcess::updateSlave}}, followed by 
> {{allocator->recoverResources}}.
> - {{HierarchicalAllocatorProcess::updateSlave}} would update 
> {{roleSorter.total_}} to reduce to total so the total could go below the 
> allocation.
> - In the subsequent {{allocator->recoverResources}} call the attempt to 
> remove outstanding allocation may fail to reduce it to below the total 
> because some allocation may not be in outstanding offers. It could be in 
> offered resources pending between {{Master::accept}} and {{Master::_accept}}. 
> So the end result could still be {{total < allocation}}.
> - Then when {{Master::_accept}} is executed, it will then call 
> {{allocator->updateAllocation}}, in which the {{total < allocation}} 
> condition could trigger such crash.
> The gist is that there are resources that are neither in master's {{offers}} 
> or tracked in the allocator when {{Master::updateSlave}} is called.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to