> On July 15, 2019, 9:14 a.m., Benjamin Bannier wrote:
> > src/master/master.cpp
> > Lines 6255-6260 (patched)
> > <https://reviews.apache.org/r/71008/diff/4/?file=2154545#file2154545line6255>
> >
> >     It seems we only do this check to make sure we can access the config 
> > below which introduces quite some coupling. Is there a reason we couldn't 
> > grab the config outside the lambda and capture it instead (i.e., do we want 
> > to support mutable drain configs)? That would allow us to reduce coupling 
> > between `Slave::draining` and `markGone`.
> 
> Joseph Wu wrote:
>     This check is specifically to guard against an interleaving of the 
> `RemoveSlave` and `MarkAgentDrained` registry operations.  There are a 
> variety of ways to trigger the `RemoveSlave`, one of which is shutting down 
> the agent (SIGUSR1).
>     
>     So imagine the following sequence of events:
>     1) Agent sends the master a `UnregisterSlaveMessage`.
>     2) Master starts the `RemoveSlave` operation.
>     3) Final terminal ACK arrives at the master, which causes master to call 
> `checkAndTransitionDrainingAgent` and `MarkAgentDrained`.
>     4) `RemoveSlave` completes.  Master clears memory of that agent.
>     5) `MarkAgentDrained` completes.  Master no longer knows about that agent 
> and hits this LOG line.
> 
> Benjamin Bannier wrote:
>     That chain of event seems pretty clear, but I was after something else: 
> right now we seem to perform this check here just so we can access the 
> config; `markGone` asserts that `slaves.markingGone.contains(slaveId)` while 
> we here ensure `slaves.draining.contains(slaveId)`. That seems like 
> unnecessary and complicated coupling to me which I'd prefer we wouldn't 
> introduce.
>     
>     In order to remove the need for checking `slaves.draining` we could 
> capture the drain config by value into the closure (which would effectively 
> require that drain configs are immutable) and would then invoke `markGone` 
> regardless on whether an agent is in `slaves.draining`. For your point (5) we 
> should instead perform a precondition check with something more closely 
> related, e.g., check whether the agent is present in `slaves.markingGone`.

The `Master::markGone()` function is currently called by the 
`Master::_markAgentGone()` handler, which ensures that the agent exists, then 
checkpoints to the registry, then unconditionally calls into 
`Master::markGone()`. It looks like the only requirement of 
`Master::markGone()` is that the agent is present in 
`master->slaves.markingGone`. So, I think as long as we add the agent to 
`slaves.markingGone` before persisting to the registry, it's fine to capture 
the `markGone` bit in the lambda and call into `Master::markGone()` whether or 
not the agent has been removed in the interim. And actually, if we add the 
agent to `slaves.markingGone`, then the `UnregisterSlaveMessage` code path 
would not succeed in removing that agent, since it checks the contents of 
`slaves.markingGone`.


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71008/#review216605
-----------------------------------------------------------


On July 15, 2019, 6:19 p.m., Joseph Wu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71008/
> -----------------------------------------------------------
> 
> (Updated July 15, 2019, 6:19 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-9814
>     https://issues.apache.org/jira/browse/MESOS-9814
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This adds logic in the master to detect when a DRAINING agent can
> be transitioned into a DRAINED state.  When this happens, the new
> state is checkpointed into the registry and, if the agent is to be
> marked "gone", the master will remove the agent.
> 
> 
> Diffs
> -----
> 
>   src/master/http.cpp cd0f40cb7b966d6620e3fb49d4c08807185c9101 
>   src/master/master.hpp e8def83fe9bcee19772df9a9764852bc694c5247 
>   src/master/master.cpp 5247377c2e7e92b9843dd4c9d28f92ba679ad742 
> 
> 
> Diff: https://reviews.apache.org/r/71008/diff/5/
> 
> 
> Testing
> -------
> 
> See: https://reviews.apache.org/r/71069/
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>

Reply via email to