----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/73131/ -----------------------------------------------------------
Review request for mesos and Benjamin Mahler. Bugs: MESOS-10209 https://issues.apache.org/jira/browse/MESOS-10209 Repository: mesos Description ------- During master failover if agent reregistration runs concurrently with marking the agent as unreachable and finishes before the MarkUnreachable operation is complete, the assertion that the agent is in the recovered set in Master::_markUnreachable() doesn't hold. The reason for this is because after readmitting the agent the master removes it from the recovered set in Master::__reregisterSlave(). We can fix this by ignoring agent reregistration requests while a marking unreachable operation is in progress, similarly to how we do it for marking gone. Once the marking operation is complete, the agent will be able to reregister as usual. Diffs ----- src/master/master.cpp 164720a3ad40773b6de0268e3a7119de04bf297e src/tests/master_tests.cpp cd0973ed4cc8fc33de714d59c7680aef05b97b47 Diff: https://reviews.apache.org/r/73131/diff/1/ Testing ------- Ran `make check`. Verified that the new test crashes without the fix. Thanks, Ilya Pronin