----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/52465/#review151601 -----------------------------------------------------------
Fix it, then Ship it! Thanks for fixing this! Glad that you added a second test for the decreasing case. Can you file a bug for this issue? src/master/master.cpp (lines 5554 - 5573) <https://reviews.apache.org/r/52465/#comment220022> How about moving this up to above the update to the agent's resources? Also I think we can make this a bit more succinct: ``` // NOTE: We must *first* update the agent's resources before we // recover the resources. If we recovered the resources first, // an allocation could trigger between recovering resources and // updating the agent in the allocator. This would lead us to // re-send out the stale oversubscribed resources! ``` src/tests/oversubscription_tests.cpp (line 443) <https://reviews.apache.org/r/52465/#comment220023> This says "if" but we should say something like "In this test the oversubscribed resources are increased, so the master will send out ...". src/tests/oversubscription_tests.cpp (line 515) <https://reviews.apache.org/r/52465/#comment220034> Do you need this WillRepeatedly? src/tests/oversubscription_tests.cpp (line 523) <https://reviews.apache.org/r/52465/#comment220026> Not yours, but can we move this pause to the top? src/tests/oversubscription_tests.cpp (lines 533 - 535) <https://reviews.apache.org/r/52465/#comment220025> Mind lining this up against the parenthesis? ``` EXPECT_EQ(createRevocableResources("cpus", "2"), Resources(offers3.get()[0].resources())); ``` src/tests/oversubscription_tests.cpp (lines 542 - 544) <https://reviews.apache.org/r/52465/#comment220027> ``` // Advance the clock to trigger a batch allocation, this will // allocate the oversubscribed resources that were rescinded. ``` src/tests/oversubscription_tests.cpp (lines 558 - 563) <https://reviews.apache.org/r/52465/#comment220029> How about s/shrinked/decreased/ ? src/tests/oversubscription_tests.cpp (lines 560 - 562) <https://reviews.apache.org/r/52465/#comment220030> Same comment here as above about the use of "if". src/tests/oversubscription_tests.cpp (line 631) <https://reviews.apache.org/r/52465/#comment220035> Do you need this WillRepeatedly? src/tests/oversubscription_tests.cpp (lines 633 - 634) <https://reviews.apache.org/r/52465/#comment220031> s/shrinked/decreased/ src/tests/oversubscription_tests.cpp (line 639) <https://reviews.apache.org/r/52465/#comment220032> Can this be moved up to the beginning of the test? (Eventually we want to force a paused clock for all tests to eliminate implicit reliance on time progressing, leads to too many flaky tests). src/tests/oversubscription_tests.cpp (lines 646 - 648) <https://reviews.apache.org/r/52465/#comment220033> Ditto comment from the test above. - Benjamin Mahler On Oct. 1, 2016, 9:34 a.m., Guangya Liu wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/52465/ > ----------------------------------------------------------- > > (Updated Oct. 1, 2016, 9:34 a.m.) > > > Review request for mesos and Benjamin Mahler. > > > Repository: mesos > > > Description > ------- > > The reason that we need `updateSlave` first and then rescind offer > is because of a race condition case: there may be a batch allocation > triggered between rescind offer and `updateSlave`. In this case, the > order will be rescind offer -> batch allocation -> update slave. This > order will cause some issues when the oversubscribed resources was > shrinked. > > Suppose the oversubscribed resources was shrinked from 2 to 1, then > after rescind offer finished, the batch allocation will allocate the > old 2 oversubscribed resources again, then update slave will update > the total oversubscribed resources to 1. This will cause the agent > host have some time overcommitted due to the tasks can still use 2 > oversubscribed resources but not 1 oversubscribed resources, once > the tasks using the 2 oversubscribed resources finished, everything > goes back. > > If we update slave first then rescind offer, the order will be update > slave -> batch allocation -> rescind offer, this order will have no > problem when shrinking resources. Suppose the oversubscribed resources > was shrinked from 2 to 1, then update slave will update total > oversubscribed resources to 1 directly, then the batch allocation will > not allocate any oversubscribed resources since there are more > allocated than total oversubscribed resources, then rescind offer > will rescind all offers using oversubscribed resources. This will > not lead the agent host to be overcommitted. > > > Diffs > ----- > > src/master/master.cpp c83ee2f9fa05372748ff5056229fbe2bf06bfabb > src/tests/oversubscription_tests.cpp > 3dd34ea78ac795a6b0d342dcae86642c51841eea > > Diff: https://reviews.apache.org/r/52465/diff/ > > > Testing > ------- > > make > make check > > ``` > GLOG_v=1 ./bin/mesos-tests.sh > --gtest_filter="OversubscriptionTest.RescindRevocableOffer*" --verbose > ``` > > > Thanks, > > Guangya Liu > >