[ 
https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126305#comment-16126305
 ] 

Huadong Liu commented on MESOS-7882:
------------------------------------

I was able to repro the problem. My setup has two mesos agents
{noformat}
af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153
af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14
{noformat}

The modified example framework is going to hold received offers for 30 seconds 
and only launch tasks on S0.
{noformat}
diff --git a/src/examples/python/test_framework.py 
b/src/examples/python/test_framework.py
     def resourceOffers(self, driver, offers):
+        time.sleep(30)
         for offer in offers:
+            if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' == 
offer.slave_id.value:
+                print("ignore offers from 
af584a07-7b1c-4955-861e-63585af8bb5d-S1")
+                continue
             tasks = []
{noformat} 

Start test-framework, and post a maintenance schedule of S1 on another terminal 
while the framework is in sleep.
{noformat}
~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050
I0814 11:48:21.296404  4182 sched.cpp:232] Version: 1.3.0
I0814 11:48:21.301652  4222 sched.cpp:336] New master detected at 
master@10.255.52.14:5050
I0814 11:48:21.302145  4222 sched.cpp:352] No credentials provided. Attempting 
to register without authentication
I0814 11:48:21.306299  4224 sched.cpp:759] Framework registered with 
af584a07-7b1c-4955-861e-63585af8bb5d-0014
Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014

---------------------
$ cat schedule.json
{
  "windows" : [
    {
      "machine_ids" : [
        { "ip" : "10.255.52.14" }
      ],
      "unavailability" : {
        "start" : { "nanoseconds" : 1502734375000000000 },
        "duration" : { "nanoseconds" : 3600000000000 }
      }
    }
  ]
}
$ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type: 
application/json" -X POST -d @schedule.json
----------------

Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and 
mem: 2927.0
Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and 
mem: 2927.0
Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
W0814 11:49:51.406801  4218 sched.cpp:1371] Attempting to accept an unknown 
offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Task 0 is in state TASK_LOST
{noformat}

Mesos master logs while this is happening:
{noformat}
I0814 11:48:21.302987  1530 master.cpp:2596] Received SUBSCRIBE call for 
framework 'Test Framework (Python)' at 
scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:21.303450  1530 master.cpp:2672] Subscribing framework Test 
Framework (Python) with checkpointing enabled and capabilities [  ]
I0814 11:48:21.304566  1529 hierarchical.cpp:275] Added framework 
af584a07-7b1c-4955-861e-63585af8bb5d-0014
I0814 11:48:21.306139  1530 master.cpp:6517] Sending 2 offers to framework 
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at 
scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.076035  1533 http.cpp:391] HTTP POST for 
/master/maintenance/schedule from 10.255.55.153:37186 with 
User-Agent='curl/7.47.0'
I0814 11:48:25.077271  1533 registrar.cpp:461] Applied 1 operations in 
272915ns; attempting to update the registry
I0814 11:48:25.078277  1533 coordinator.cpp:348] Coordinator attempting to 
write APPEND action at position 39
I0814 11:48:25.079033  1533 replica.cpp:537] Replica received write request for 
position 39 from __req_res__(44)@10.255.52.14:5050
I0814 11:48:25.082299  1531 replica.cpp:691] Replica received learned notice 
for position 39 from @0.0.0.0:0
I0814 11:48:25.085546  1531 registrar.cpp:506] Successfully updated the 
registry in 8.176128ms
I0814 11:48:25.085726  1535 coordinator.cpp:348] Coordinator attempting to 
write TRUNCATE action at position 40
I0814 11:48:25.086496  1528 master.cpp:5645] Removing unavailability of agent 
af584a07-7b1c-4955-861e-63585af8bb5d-S1 at slave(1)@10.255.52.14:5051 
(10.255.52.14)
I0814 11:48:25.086550  1530 replica.cpp:537] Replica received write request for 
position 40 from __req_res__(45)@10.255.52.14:5050
I0814 11:48:25.087936  1530 replica.cpp:691] Replica received learned notice 
for position 40 from @0.0.0.0:0
I0814 11:48:25.088673  1528 master.cpp:5645] Removing unavailability of agent 
af584a07-7b1c-4955-861e-63585af8bb5d-S0 at slave(1)@10.255.55.153:5051 
(10.255.55.153)
I0814 11:48:25.089725  1528 master.cpp:6517] Sending 1 offers to framework 
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at 
scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.090461  1529 master.cpp:6517] Sending 1 offers to framework 
af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at 
scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
W0814 11:49:51.408465  1534 master.cpp:3494] Ignoring accept of offer 
af584a07-7b1c-4955-861e-63585af8bb5d-O153 since it is no longer valid
W0814 11:49:51.408888  1534 master.cpp:3505] ACCEPT call used invalid offers '[ 
af584a07-7b1c-4955-861e-63585af8bb5d-O153 ]': Offer 
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid
I0814 11:49:51.409276  1534 master.cpp:5772] Sending status update TASK_LOST 
for task 0 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task 
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 
is no longer valid'
I0814 11:49:51.409920  1534 master.cpp:5772] Sending status update TASK_LOST 
for task 1 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task 
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 
is no longer valid'
I0814 11:49:51.410332  1534 master.cpp:5772] Sending status update TASK_LOST 
for task 2 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task 
launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 
is no longer valid'
{noformat}

> Mesos master rescinds all the in-flight offers from all the registered agents 
> when a new maintenance schedule is posted for a subset of slaves
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-7882
>                 URL: https://issues.apache.org/jira/browse/MESOS-7882
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.0
>         Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
>            Reporter: Sagar Sadashiv Patwardhan
>            Priority: Minor
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for 
> scaling our mesos  cluster up and down. While scaling down the cluster, 
> autoscaler makes a POST request to mesos master /maintenance/schedule 
> endpoint with a set of slaves to move to maintenance mode. This forces mesos 
> master to rescind all the in-flight offers from *all the slaves* in the 
> cluster. If our scheduler accepts one of these offers, then we get a 
> TASK_LOST status update back for that task. We also see such 
> (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log 
> lines in mesos master logs.
> After reading the code(refs: 
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it 
> appears that offers are getting rescinded for all the slaves. I am not sure 
> what is the expected behavior here, but it makes more sense if only resources 
> from slaves marked for maintenance are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha: 
> a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log 
> lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). 
> Built the binary and started a mesos master and 2 agent processes. Used a 
> basic python framework that launches docker containers on these slaves. 
> Verified that there is no existing schedule for any slaves using `curl 
> 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of 
> the 
> slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) 
> after starting the mesos framework.
> *Logs:*
> mesos-master: 
> https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1: 
> https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2: 
> https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework: 
> https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves 
> that are marked for maintenance(draining mode).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to