Yan Xu created MESOS-7639:
-----------------------------

             Summary: Oversubscription could crash the master due to CHECK 
failure in the allocator
                 Key: MESOS-7639
                 URL: https://issues.apache.org/jira/browse/MESOS-7639
             Project: Mesos
          Issue Type: Bug
            Reporter: Yan Xu


As I described in MESOS-7566, the following scenario is possible when the agent 
sends updated oversubscribed resources to the master:

- The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources.
- {{Master::updateSlave}} upon receiving the update would first call 
{{HierarchicalAllocatorProcess::updateSlave}}, followed by 
{{allocator->recoverResources}}.
- {{HierarchicalAllocatorProcess::updateSlave}} would update 
{{roleSorter.total_}} to reduce to total so the total could go below the 
allocation.
- In the subsequent {{allocator->recoverResources}} call the attempt to remove 
outstanding allocation may fail to reduce it to below the total because some 
allocation may not be in outstanding offers. It could be in offered resources 
pending between {{Master::accept}} and {{Master::_accept}}. So the end result 
could still be {{total < allocation}}.
- Then when {{Master::_accept}} is executed, it will then call 
{{allocator->updateAllocation}}, in which the {{total < allocation}} condition 
could trigger such crash.

The gist is that there are resources that are neither in master's {{offers}} or 
tracked in the allocator when {{Master::updateSlave}} is called.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to