[ https://issues.apache.org/jira/browse/MESOS-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097047#comment-16097047 ]
Dmitriy Shirchenko commented on MESOS-7639: ------------------------------------------- A small update on that we saw another instance of this crash. Since we have a patched version I will provide the code below with logs {code} F0721 21:43:29.141577 7454 master.cpp:9218] CHECK_SOME(resources): Invalid RESERVE Operation: cpus(*):24; mem(*):122880; ports(*):[31000-32000]; disk(*):849596; cpus(*)(allocated: aurora){REV}:12 does not contain ports(aurora, aurora, {instance_key: foo/foo/foo.foo/0})(allocated: aurora):[31139-31139, 31773-31773, 31827-31827] {code} Crash was happening on CHECK_SOME line. {code} void Slave::apply(const Offer::Operation& operation) { Try<Resources> resources = totalResources.apply(operation); CHECK_SOME(resources); totalResources = resources.get(); checkpointedResources = totalResources.filter(needCheckpointing); } {code} Context is that a large job was getting updated with RESERVE resources. [~bmahler] please let me know what else I can provide. Sorry, this may not be enough for you to go off on. > Oversubscription could crash the master due to CHECK failure in the allocator > ----------------------------------------------------------------------------- > > Key: MESOS-7639 > URL: https://issues.apache.org/jira/browse/MESOS-7639 > Project: Mesos > Issue Type: Bug > Reporter: Yan Xu > > As I described in MESOS-7566, the following scenario is possible when the > agent sends updated oversubscribed resources to the master: > - The agent's {{UpdateSlaveMessage}} reduces the the oversubscribed resources. > - {{Master::updateSlave}} upon receiving the update would first call > {{HierarchicalAllocatorProcess::updateSlave}}, followed by > {{allocator->recoverResources}}. > - {{HierarchicalAllocatorProcess::updateSlave}} would update > {{roleSorter.total_}} to reduce to total so the total could go below the > allocation. > - In the subsequent {{allocator->recoverResources}} call the attempt to > remove outstanding allocation may fail to reduce it to below the total > because some allocation may not be in outstanding offers. It could be in > offered resources pending between {{Master::accept}} and {{Master::_accept}}. > So the end result could still be {{total < allocation}}. > - Then when {{Master::_accept}} is executed, it will then call > {{allocator->updateAllocation}}, in which the {{total < allocation}} > condition could trigger such crash. > The gist is that there are resources that are neither in master's {{offers}} > or tracked in the allocator when {{Master::updateSlave}} is called. -- This message was sent by Atlassian JIRA (v6.4.14#64029)