----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53897/ -----------------------------------------------------------
(Updated Dec. 7, 2016, 4:14 a.m.) Review request for mesos and Vinod Kone. Changes ------- Add additional CHECKs, add new unit test, tweak a few tests. Bugs: MESOS-6419 https://issues.apache.org/jira/browse/MESOS-6419 Repository: mesos Description (updated) ------- After master failover, the new master doesn't know which frameworks were registered with the previous master (because this information is not currently stored in the registry). In the period after the master fails over but before the framework scheduler has re-registered, the master learns about the frameworks in the cluster when agents re-register (an agent reports the FrameworkInfo for all of the frameworks it is running when it re-registers). Such frameworks were previously represented separately from the normal list of frameworks in the master: the master kept a collection of `FrameworkInfo` for these "recovered" frameworks. This commit removes this separate collection of "recovered" frameworks. Instead, the master now treats recovered frameworks very similarly to frameworks that are registered but currently disconnected. For example, recovered frameworks will now have a `Framework` object which tracks the tasks/executors running under that framework; recovered frameworks will be reported via the normal "frameworks" key when querying HTTP endpoints. Similarly, "teardown" operations on recovered frameworks will now work correctly (MESOS-6419). This means there is no longer a concept of "orphan tasks" [1]: if the master knows about a task, the task will be running under a framework (albeit the framework might be recovered or disconnected). A new "recovered" key has been added to various HTTP endpoints/APIs to determine if a framework hasn't yet re-registered after master failover. Another behavioral change is that, previously, a framework re-registering after master failover was allowed to make arbitrary changes to its `FrameworkInfo`. The master now only applies updates to FrameworkInfo fields that can be safely changed (as was already done when a framework fails over to a master that hasn't failed over). The previous behavior qualifies as a bug. [1] The exception here is if the cluster contains Mesos agents older than 1.0, because old Mesos agents don't report `FrameworkInfo`s when they re-register. Diffs (updated) ----- include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 src/tests/api_tests.cpp 7c70f284a9225ca23abc7049f48f3efba314b641 src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 Diff: https://reviews.apache.org/r/53897/diff/ Testing ------- `make check` Thanks, Neil Conway