> On April 14, 2016, 4:49 p.m., Vinod Kone wrote:
> > src/launcher/http_command_executor.cpp, line 749
> > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749>
> >
> >     Looking at slave::statusUpdate() code there are several scenarios where 
> > the slave ignores a status update sent by the executor; this means this 
> > executor could end up not terminating forever.
> >     
> >     Can you do the following:
> >     
> >     --> Enque a message in the queue to self terminate after some timeout 
> > (you can use the delay() function) to be safe.
> >     
> >     --> Add a TODO that we do this to be safe and also because slave 
> > sometimes doesn't ACK a status update. Link to a ticket that fixes the 
> > slave status update semantics to always ACK a status update sent by an 
> > executor.
> >     
> >     sounds good?
> 
> Vinod Kone wrote:
>     @Qian, any update on this? If this particular review is going to take 
> some time, I think it is still useful two commit the other 2 reviews in this 
> chain. AFAICT, they are independent of this review?
> 
> Qian Zhang wrote:
>     @Vinod, sorry for the late. I have filed a ticket 
> (https://issues.apache.org/jira/browse/MESOS-5262) for enhancing 
> `slave::statusUpdate()` to always ACK the status update sent by executor.
>     
>     And can you please elaborate about the specific scenarios this executor 
> could not terminate forever. Originially I thought the scenario should be: 
> executor sends a terminal status upate to slave when the corresponding 
> framework is in `TERMINATING` state (e.g., operator tears down the 
> framework), then in `Slave::statusUpdate()`, this status update will be 
> ignored, so the executor will not get the ACK. But after testing, I found in 
> this case the executor can still terminate, because the container 
> corresponded to this executor will be destroyed by 
> `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so 
> after `--executor_shutdown_grace_period`, the executor can still terminate. 
> So I am not in which case the executor will never terminate.
>     
>     And yes, the other 2 patches are independent of this one, I will make 
> them not depending on this one in the review board, thanks!
> 
> Qian Zhang wrote:
>     After more thinking, I see one scenario the executor could never 
> terminate is: agent is down right after it sends SHUTDOWN event to executor. 
> In this case, when handling the SHUTDOWN event, executor will kill the task 
> and send TASK_KILLED status update to agent, but it will not get ACK since 
> agent is already down, so the executor will still run. But I think once agent 
> is started again, executor will receive the ACK and then terminate. I am not 
> sure if this behavior is OK, or we want executor to terminate once it 
> receives the SHUTDOWN event rather than wait for agent is started again?
> 
> Vinod Kone wrote:
>     I think it is the right thing for the executor to stay up until the agent 
> comes back up and sends an ACK. Otherwise, which is currently the case, the 
> agent has no idea about the exit status of the executor.
>     
>     So the cases where I see no ACKs from the agent are:
>     
>     1) `update` doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably 
> shutdown the executor instead of just ignoring.
>     
>     2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to 
> Executor object?
>     
>     3) Framework is terminating.
>       --> Per your observation the agent is guaranteed to destroy the 
> container. We need to add a comment here explainining this.
>     
>     4) Executor is unknown. While we forward the status update to the 
> framework, the ACK is dropped by `_statusUpdate` and `___statusUpdate()`
>        --> If the update is generated by the agent (`killTask()`, 
> `_runTask()`) we should be fine because there is no executor.
>        --> If the update is sent by an executor for a task belonging to 
> another executor, that another executor is hopefully already dead. So we 
> should be fine.
>        --> Is it possible for the update to be sent by an executor that is 
> not known to the agent? Don't think we can do much here since we don't have 
> access to the Executor object.
> 
> Qian Zhang wrote:
>     > I think it is the right thing for the executor to stay up until the 
> agent comes back up and sends an ACK. Otherwise, which is currently the case, 
> the agent has no idea about the exit status of the executor.
>     
>     But that seems conflict with your original comment. My understanding to 
> your original comment is, to avoid executor not terminating forever, executor 
> needs to enque a message to self terminate after some timeout (I think we can 
> do it via delay() in `reaped()` after terminal status update is sent), if so, 
> then when executor handles SHUTDOWN event, it will self terminate anyway 
> during agent down, so it is not possible to make executor stay up until agent 
> comes back.
>     
>     With this patch, I think the behavior of HTTP commmand executor and 
> driver based command executor will be different, let's see this scenario: a 
> checkpoint-enabled frameworks launches a "sleep 100" task, after the task is 
> running, agent is down, and during agent down, the task finishes, and then 
> agent is started again. In this case, after agent is started again, for 
> driver-based command executor, framework will receive a TASK_FAILED since the 
> executor has terminated during agent is down, but for HTTP command executor 
> (with this patch applied), framework will receive a TASK_FINISHED since the 
> executer will NOT terminate until it receives the ACK for the TASK_FINISHED.
>     
>     I think we need to make them have consistent behavior. Any suggestions? 
> :-)
>     
>     > 1) update doesn't have UUID.
>       --> Is this possible with the HTTP executor? If not, we should probably 
> shutdown the executor instead of just ignoring.
>     
>     This is not possible with HTTP command executor, because in 
> `HttpCommandExecutor::update()` we always set UUID in the status upate before 
> sending it to agent.
>     
>     > 2) Framework is unknown.
>       --> Don't think we can do much here because we don't have access to 
> Executor object?
>     
>     I do not think this will happen in our case, because framework will only 
> be removed (`Slave::removeFramework()`) when it has no executor 
> (`framework->executors.empty()`), but in our case, it must have one executor 
> (the HTTP command executor).
>     
>     
>     > 4) Executor is unknown. While we forward the status update to the 
> framework, the ACK is dropped by _statusUpdate and ___statusUpdate()
>     
>     >   --> If the update is generated by the agent (killTask(), _runTask()) 
> we should be fine because there is no executor.
>     
>     >   --> If the update is sent by an executor for a task belonging to 
> another executor, that another executor is hopefully already dead. So we 
> should be fine.
>     
>     >   --> Is it possible for the update to be sent by an executor that is 
> not known to the agent? Don't think we can do much here since we don't have 
> access to the Executor object.
>     
>     I do not think it is possible that an executor sends a update to agent 
> but agent does not know this executor.

There might be no cases today where executor doesn't get an ACK, but I would 
still prefer for us to be safe here incase there is a bug in the slave now or 
in the future. So lets add a timeout to self terminate with a comment saying 
that we do that to be safe. The executor can exit with non-zero status so that 
we can catch such situations.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46187/#review128916
-----------------------------------------------------------


On April 27, 2016, 1:23 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46187/
> -----------------------------------------------------------
> 
> (Updated April 27, 2016, 1:23 a.m.)
> 
> 
> Review request for mesos, Anand Mazumdar and Vinod Kone.
> 
> 
> Bugs: MESOS-4855
>     https://issues.apache.org/jira/browse/MESOS-4855
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Terminate when receiving the ACK of terminal status update.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 
> 0b4136c040ec9cf585c0d38f8471495a855cd640 
> 
> Diff: https://reviews.apache.org/r/46187/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>

Reply via email to