> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote: > > src/master/master.cpp, line 5602 > > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line5602> > > > > shouldn't we send a ShutdownFrameworkMessage in this case?
This is implemented in a later review in this chain, https://reviews.apache.org/r/54232/ > On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 5578-5580 > > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line5578> > > > > yea, i think this should be > > > > ``` > > if (framework != nullptr && framework->connected()) { > > > > } > > ``` I did this in a separate review, https://reviews.apache.org/r/54380/ > On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote: > > src/master/master.cpp, lines 7132-7137 > > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7132> > > > > hmm. it's a bit weird that activating a framework also updates it. this > > should be done by the callers before calling this method. Hmmm, why is this weird? To me it seems reasonable that when we activate a recovered framework, we're given the `FrameworkInfo` that was presented by the framework on re-registration; we use that `FrameworkInfo` to update whatever `FrameworkInfo` we previously had for the recovered framework. If we move this outside of `activateRecoveredFramework`, we'd need identical code in both of its call-sites -- that's not too bad, but I'm not sure why it is an improvement. - Neil ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53897/#review157867 ----------------------------------------------------------- On Dec. 4, 2016, 2:06 a.m., Neil Conway wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/53897/ > ----------------------------------------------------------- > > (Updated Dec. 4, 2016, 2:06 a.m.) > > > Review request for mesos and Vinod Kone. > > > Bugs: MESOS-6419 > https://issues.apache.org/jira/browse/MESOS-6419 > > > Repository: mesos > > > Description > ------- > > After master failover, the new master doesn't know which frameworks were > registered with the previous master (because this information is not > currently stored in the registry). In the period after the master fails > over but before the framework scheduler has re-registered, the master > learns about the frameworks in the cluster when agents re-register (an > agent reports the FrameworkInfo for all of the frameworks it is running > when it re-registers). > > Such frameworks were previously represented separately from the normal > list of frameworks in the master: the master kept a collection of > `FrameworkInfo` for these "recovered" frameworks. > > This commit removes this separate collection of "recovered" frameworks. > Instead, the master now treats recovered frameworks very similarly to > frameworks that are registered but currently disconnected. For example, > recovered frameworks will now have a `Framework` object which tracks the > tasks/executors running under that framework; recovered frameworks will > be reported via the normal "frameworks" key when querying HTTP > endpoints. Similarly, "teardown" operations on recovered frameworks will > now work correctly (MESOS-6419). > > This means there is no longer a concept of "orphan tasks" [1]: if the > master knows about a task, the task will be running under a framework > (albeit the framework might be recovered or disconnected). A new > "recovered" key has been added to various HTTP endpoints/APIs to > determine if a framework hasn't yet re-registered after master failover. > > [1] The exception here is if the cluster contains Mesos agents older > than 1.0, because old Mesos agents don't report `FrameworkInfo`s when > they re-register. > > > Diffs > ----- > > include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 > include/mesos/v1/master/master.proto > 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 > src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f > src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa > src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 > src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 > src/tests/fault_tolerance_tests.cpp > 59f68f65fac9fca4a6941793b712bfe7bb30c880 > src/tests/master_allocator_tests.cpp > bb94e38d5bb472801366c172cfc036f2eecdcbcb > src/tests/master_authorization_tests.cpp > 06c6e47250003cd467bf37e1daa8bd636ef8fef5 > src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 > src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 > src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 > > Diff: https://reviews.apache.org/r/53897/diff/ > > > Testing > ------- > > `make check` > > > Thanks, > > Neil Conway > >