[ https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126305#comment-16126305 ]
Huadong Liu commented on MESOS-7882: ------------------------------------ I was able to repro the problem. My setup has two mesos agents {noformat} af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153 af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14 {noformat} The modified example framework is going to hold received offers for 30 seconds and only launch tasks on S0. {noformat} diff --git a/src/examples/python/test_framework.py b/src/examples/python/test_framework.py def resourceOffers(self, driver, offers): + time.sleep(30) for offer in offers: + if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' == offer.slave_id.value: + print("ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1") + continue tasks = [] {noformat} Start test-framework, and post a maintenance schedule of S1 on another terminal while the framework is in sleep. {noformat} ~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050 I0814 11:48:21.296404 4182 sched.cpp:232] Version: 1.3.0 I0814 11:48:21.301652 4222 sched.cpp:336] New master detected at master@10.255.52.14:5050 I0814 11:48:21.302145 4222 sched.cpp:352] No credentials provided. Attempting to register without authentication I0814 11:48:21.306299 4224 sched.cpp:759] Framework registered with af584a07-7b1c-4955-861e-63585af8bb5d-0014 Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014 --------------------- $ cat schedule.json { "windows" : [ { "machine_ids" : [ { "ip" : "10.255.52.14" } ], "unavailability" : { "start" : { "nanoseconds" : 1502734375000000000 }, "duration" : { "nanoseconds" : 3600000000000 } } } ] } $ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type: application/json" -X POST -d @schedule.json ---------------- Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and mem: 2927.0 Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1 ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1 Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and mem: 2927.0 Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 W0814 11:49:51.406801 4218 sched.cpp:1371] Attempting to accept an unknown offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 Task 0 is in state TASK_LOST {noformat} Mesos master logs while this is happening: {noformat} I0814 11:48:21.302987 1530 master.cpp:2596] Received SUBSCRIBE call for framework 'Test Framework (Python)' at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893 I0814 11:48:21.303450 1530 master.cpp:2672] Subscribing framework Test Framework (Python) with checkpointing enabled and capabilities [ ] I0814 11:48:21.304566 1529 hierarchical.cpp:275] Added framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 I0814 11:48:21.306139 1530 master.cpp:6517] Sending 2 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893 I0814 11:48:25.076035 1533 http.cpp:391] HTTP POST for /master/maintenance/schedule from 10.255.55.153:37186 with User-Agent='curl/7.47.0' I0814 11:48:25.077271 1533 registrar.cpp:461] Applied 1 operations in 272915ns; attempting to update the registry I0814 11:48:25.078277 1533 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 39 I0814 11:48:25.079033 1533 replica.cpp:537] Replica received write request for position 39 from __req_res__(44)@10.255.52.14:5050 I0814 11:48:25.082299 1531 replica.cpp:691] Replica received learned notice for position 39 from @0.0.0.0:0 I0814 11:48:25.085546 1531 registrar.cpp:506] Successfully updated the registry in 8.176128ms I0814 11:48:25.085726 1535 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 40 I0814 11:48:25.086496 1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S1 at slave(1)@10.255.52.14:5051 (10.255.52.14) I0814 11:48:25.086550 1530 replica.cpp:537] Replica received write request for position 40 from __req_res__(45)@10.255.52.14:5050 I0814 11:48:25.087936 1530 replica.cpp:691] Replica received learned notice for position 40 from @0.0.0.0:0 I0814 11:48:25.088673 1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S0 at slave(1)@10.255.55.153:5051 (10.255.55.153) I0814 11:48:25.089725 1528 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893 I0814 11:48:25.090461 1529 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 (Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893 W0814 11:49:51.408465 1534 master.cpp:3494] Ignoring accept of offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 since it is no longer valid W0814 11:49:51.408888 1534 master.cpp:3505] ACCEPT call used invalid offers '[ af584a07-7b1c-4955-861e-63585af8bb5d-O153 ]': Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid I0814 11:49:51.409276 1534 master.cpp:5772] Sending status update TASK_LOST for task 0 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid' I0814 11:49:51.409920 1534 master.cpp:5772] Sending status update TASK_LOST for task 1 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid' I0814 11:49:51.410332 1534 master.cpp:5772] Sending status update TASK_LOST for task 2 of framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid' {noformat} > Mesos master rescinds all the in-flight offers from all the registered agents > when a new maintenance schedule is posted for a subset of slaves > ---------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MESOS-7882 > URL: https://issues.apache.org/jira/browse/MESOS-7882 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.3.0 > Environment: Ubuntu 14:04(trusty) > Mesos master branch. > SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded > Reporter: Sagar Sadashiv Patwardhan > Priority: Minor > > We are running mesos 1.1.0 in production. We use a custom autoscaler for > scaling our mesos cluster up and down. While scaling down the cluster, > autoscaler makes a POST request to mesos master /maintenance/schedule > endpoint with a set of slaves to move to maintenance mode. This forces mesos > master to rescind all the in-flight offers from *all the slaves* in the > cluster. If our scheduler accepts one of these offers, then we get a > TASK_LOST status update back for that task. We also see such > (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118) log > lines in mesos master logs. > After reading the code(refs: > https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772), it > appears that offers are getting rescinded for all the slaves. I am not sure > what is the expected behavior here, but it makes more sense if only resources > from slaves marked for maintenance are reclaimed. > *Experiment:* > To verify that it is actually happening, I checked out the master branch(sha: > a31dd52ab71d2a529b55cd9111ec54acf7550ded ) and added some log > lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3). > Built the binary and started a mesos master and 2 agent processes. Used a > basic python framework that launches docker containers on these slaves. > Verified that there is no existing schedule for any slaves using `curl > 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule for one of > the > slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0) > after starting the mesos framework. > *Logs:* > mesos-master: > https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203 > mesos-slave1: > https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31 > mesos-slave2: > https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426 > Mesos framework: > https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a > I think mesos should rescind offers and inverse offers only for those slaves > that are marked for maintenance(draining mode). -- This message was sent by Atlassian JIRA (v6.4.14#64029)