Jan Schlicht created MESOS-8524:
-----------------------------------

             Summary: When `UPDATE_SLAVE` messages are received, offers might 
not be recinded due to a race 
                 Key: MESOS-8524
                 URL: https://issues.apache.org/jira/browse/MESOS-8524
             Project: Mesos
          Issue Type: Bug
          Components: allocation, master
    Affects Versions: 1.5.0
         Environment: Master + Agent running with enabled {{RESOURCE_PROVIDER}} 
capability
            Reporter: Jan Schlicht


When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers with 
the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In the 
master, the agent is added (back) to the allocator, as soon as it's 
(re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
is being handled in the master, these offers have to be rescinded, as they're 
based on an outdated agent state.
Internally, the allocator defers a offer callback in the master 
({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
the same time and its handler in the master called before the offer callback 
(but after the actual allocation took place). In this case the (outdated) offer 
is still sent to frameworks and never rescinded.

Here's the relevant log lines, this was discovered while working on 
https://reviews.apache.org/r/65045/:
{noformat}
I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation for 
1 agents in 704915ns
I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
(172.18.8.20) with total oversubscribed resources {}
I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to framework 
53c557e7-3161-449b-bacc-a4f8c02e78e7-0000 (default) at 
scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
40444ns
I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } (used)
I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to