[ https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Bannier updated MESOS-8524: ------------------------------------ Comment: was deleted (was: Review: https://reviews.apache.org/r/65506/) > When `UPDATE_SLAVE` messages are received, offers might not be rescinded due > to a race > --------------------------------------------------------------------------------------- > > Key: MESOS-8524 > URL: https://issues.apache.org/jira/browse/MESOS-8524 > Project: Mesos > Issue Type: Bug > Components: allocation, master > Affects Versions: 1.5.0 > Environment: Master + Agent running with enabled > {{RESOURCE_PROVIDER}} capability > Reporter: Jan Schlicht > Assignee: Benjamin Bannier > Priority: Major > Labels: mesosphere > > When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers > with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In > the master, the agent is added (back) to the allocator, as soon as it's > (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an > allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} > is being handled in the master, these offers have to be rescinded, as they're > based on an outdated agent state. > Internally, the allocator defers a offer callback in the master > ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at > the same time and its handler in the master called before the offer callback > (but after the actual allocation took place). In this case the (outdated) > offer is still sent to frameworks and never rescinded. > Here's the relevant log lines, this was discovered while working on > https://reviews.apache.org/r/65045/: > {noformat} > I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation > for 1 agents in 704915ns > I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 > (172.18.8.20) with total oversubscribed resources {} > I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to > framework 53c557e7-3161-449b-bacc-a4f8c02e78e7-0000 (default) at > scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469 > I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took > 40444ns > I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), { } > (used) > I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent > 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total > resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)