[ https://issues.apache.org/jira/browse/MESOS-6317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler updated MESOS-6317: ----------------------------------- Summary: Race in master/allocator when updating oversubscribed resources of an agent. (was: Race in master update slave.) > Race in master/allocator when updating oversubscribed resources of an agent. > ---------------------------------------------------------------------------- > > Key: MESOS-6317 > URL: https://issues.apache.org/jira/browse/MESOS-6317 > Project: Mesos > Issue Type: Bug > Reporter: Guangya Liu > Assignee: Guangya Liu > Fix For: 1.1.0 > > > Currently, when {{updateSlave}} in master, it will first rescind offers and > then updateSlave in allocator, but there is a race for this, there might be a > batch allocation inserted bwteen the two. In this case, the order will be > rescind offer -> batch allocation -> update slave. This order will cause some > issues when the oversubscribed resources was decreased. > Suppose the oversubscribed resources was decreased from 2 to 1, then after > rescind offer finished, the batch allocation will allocate the old 2 > oversubscribed resources again, then update slave will update the total > oversubscribed resources to 1. This will cause the agent host have some time > overcommitted due to the tasks can still use 2 oversubscribed resources but > not 1 oversubscribed resources, once the tasks using the 2 oversubscribed > resources finished, everything goes back. > So here we should adjust the order of rescind offer and updateSlave in master > to avoid resource overcommit. > If we update slave first then rescind offer, the order will be update slave > -> batch allocation -> rescind offer, this order will have no problem when > descreasing resources. Suppose the oversubscribed resources was decreased > from 2 to 1, then update slave will update total oversubscribed resources to > 1 directly, then the batch allocation will not allocate any oversubscribed > resources since there are more allocated than total oversubscribed resources, > then rescind offer will rescind all offers using oversubscribed resources. > This will not lead the agent host to be overcommitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)