[ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410286#comment-15410286 ]
Vinod Kone commented on MESOS-4049: ----------------------------------- commit 5220f77582a14d4cdd0a907ba8af6e9db87d8ab7 Author: Neil Conway <neil.con...@gmail.com> Date: Fri Aug 5 16:41:41 2016 -0700 Future-proofed some slave removal tests. These tests relied on the implementation detail that when an agent is removed from the list of registered agents, the master sends a ShutdownSlaveMessage to the agent. That will change in the future (MESOS-4049). To prepare for this future planned behavior, adjust these tests to be more robust by instead checking for the invocation of the `slaveLost` scheduler callback. Review: https://reviews.apache.org/r/50422/ commit 8a0b17a11560f482628e890094e83400fa805a80 Author: Neil Conway <neil.con...@gmail.com> Date: Fri Aug 5 16:41:35 2016 -0700 Cleaned up comments in fault tolerance tests. Review: https://reviews.apache.org/r/50418/ commit 5de96fa4b3e603553dbae3f06aff6621b268a7be Author: Neil Conway <neil.con...@gmail.com> Date: Fri Aug 5 16:41:28 2016 -0700 Improved consistency of test code for partitioning an agent. Removed unnecessary `Clock::settle` calls: `Clock::settle` should typically only be used when a test case does not have an easy way to wait for a _specific_ event to occur. In this case, `Clock::settle` was unnecessary because the test code immediately proceeded to `AWAIT_READY` for a more specific event. Also fixed up some whitespace. Review: https://reviews.apache.org/r/50417/ commit 60dbd347b409c788776760a8270965d943b6806e Author: Neil Conway <neil.con...@gmail.com> Date: Fri Aug 5 16:41:18 2016 -0700 Added more assertions to master code. Review: https://reviews.apache.org/r/50416/ commit 29925658291be60bda7af7f83225d743e8d24870 Author: Neil Conway <neil.con...@gmail.com> Date: Fri Aug 5 16:41:10 2016 -0700 Added more expectations to TASK_LOST test cases. Check the reason and source of TASK_LOST status updates, replaced ASSERT_ with EXPECT_ in various places where EXPECT_ is more appropriate. Review: https://reviews.apache.org/r/50235/ > Allow user to control behavior of partitioned agents/tasks > ---------------------------------------------------------- > > Key: MESOS-4049 > URL: https://issues.apache.org/jira/browse/MESOS-4049 > Project: Mesos > Issue Type: Improvement > Components: master, slave > Reporter: Neil Conway > Assignee: Neil Conway > Labels: mesosphere > > At present, if an agent is partitioned away from the master, the master waits > for a period of time (see MESOS-4048) before deciding that the agent is dead. > Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the > tasks running on the agent, and instructs the agent to shutdown. > Although this behavior is desirable for some/many users, it is not ideal for > everyone. For example: > * Some users might want to aggressively start a new replacement task (e.g., > after one or two ping timeouts are missed); then when the old copy of the > task comes back, they might want to make an intelligent decision about how to > reconcile this situation (e.g., kill old, kill new, allow both to continue > running). > * Some frameworks might want different behavior from other frameworks, or to > treat some tasks differently from other tasks. For example, if a task has a > huge amount of state that would need to be regenerated to spin up another > instance, the user might want to wait longer before starting a new task to > increase the chance that the old task will reappear. > To do this, we'd need to change task state so that a task can go from > {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from > that state back to {{RUNNING}} (or perhaps we could keep the current > "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} > could also transition to {{LOST}}). The agent would also keep its old > {{slaveId}} when it reconnects. -- This message was sent by Atlassian JIRA (v6.3.4#6332)