> On April 14, 2016, 4:49 p.m., Vinod Kone wrote: > > src/launcher/http_command_executor.cpp, line 749 > > <https://reviews.apache.org/r/46187/diff/1/?file=1343828#file1343828line749> > > > > Looking at slave::statusUpdate() code there are several scenarios where > > the slave ignores a status update sent by the executor; this means this > > executor could end up not terminating forever. > > > > Can you do the following: > > > > --> Enque a message in the queue to self terminate after some timeout > > (you can use the delay() function) to be safe. > > > > --> Add a TODO that we do this to be safe and also because slave > > sometimes doesn't ACK a status update. Link to a ticket that fixes the > > slave status update semantics to always ACK a status update sent by an > > executor. > > > > sounds good? > > Vinod Kone wrote: > @Qian, any update on this? If this particular review is going to take > some time, I think it is still useful two commit the other 2 reviews in this > chain. AFAICT, they are independent of this review? > > Qian Zhang wrote: > @Vinod, sorry for the late. I have filed a ticket > (https://issues.apache.org/jira/browse/MESOS-5262) for enhancing > `slave::statusUpdate()` to always ACK the status update sent by executor. > > And can you please elaborate about the specific scenarios this executor > could not terminate forever. Originially I thought the scenario should be: > executor sends a terminal status upate to slave when the corresponding > framework is in `TERMINATING` state (e.g., operator tears down the > framework), then in `Slave::statusUpdate()`, this status update will be > ignored, so the executor will not get the ACK. But after testing, I found in > this case the executor can still terminate, because the container > corresponded to this executor will be destroyed by > `Slave::shutdownExecutorTimeout()` -> `MesosContainerizer::destroy()`, so > after `--executor_shutdown_grace_period`, the executor can still terminate. > So I am not in which case the executor will never terminate. > > And yes, the other 2 patches are independent of this one, I will make > them not depending on this one in the review board, thanks! > > Qian Zhang wrote: > After more thinking, I see one scenario the executor could never > terminate is: agent is down right after it sends SHUTDOWN event to executor. > In this case, when handling the SHUTDOWN event, executor will kill the task > and send TASK_KILLED status update to agent, but it will not get ACK since > agent is already down, so the executor will still run. But I think once agent > is started again, executor will receive the ACK and then terminate. I am not > sure if this behavior is OK, or we want executor to terminate once it > receives the SHUTDOWN event rather than wait for agent is started again? > > Vinod Kone wrote: > I think it is the right thing for the executor to stay up until the agent > comes back up and sends an ACK. Otherwise, which is currently the case, the > agent has no idea about the exit status of the executor. > > So the cases where I see no ACKs from the agent are: > > 1) `update` doesn't have UUID. > --> Is this possible with the HTTP executor? If not, we should probably > shutdown the executor instead of just ignoring. > > 2) Framework is unknown. > --> Don't think we can do much here because we don't have access to > Executor object? > > 3) Framework is terminating. > --> Per your observation the agent is guaranteed to destroy the > container. We need to add a comment here explainining this. > > 4) Executor is unknown. While we forward the status update to the > framework, the ACK is dropped by `_statusUpdate` and `___statusUpdate()` > --> If the update is generated by the agent (`killTask()`, > `_runTask()`) we should be fine because there is no executor. > --> If the update is sent by an executor for a task belonging to > another executor, that another executor is hopefully already dead. So we > should be fine. > --> Is it possible for the update to be sent by an executor that is > not known to the agent? Don't think we can do much here since we don't have > access to the Executor object. > > Qian Zhang wrote: > > I think it is the right thing for the executor to stay up until the > agent comes back up and sends an ACK. Otherwise, which is currently the case, > the agent has no idea about the exit status of the executor. > > But that seems conflict with your original comment. My understanding to > your original comment is, to avoid executor not terminating forever, executor > needs to enque a message to self terminate after some timeout (I think we can > do it via delay() in `reaped()` after terminal status update is sent), if so, > then when executor handles SHUTDOWN event, it will self terminate anyway > during agent down, so it is not possible to make executor stay up until agent > comes back. > > With this patch, I think the behavior of HTTP commmand executor and > driver based command executor will be different, let's see this scenario: a > checkpoint-enabled frameworks launches a "sleep 100" task, after the task is > running, agent is down, and during agent down, the task finishes, and then > agent is started again. In this case, after agent is started again, for > driver-based command executor, framework will receive a TASK_FAILED since the > executor has terminated during agent is down, but for HTTP command executor > (with this patch applied), framework will receive a TASK_FINISHED since the > executer will NOT terminate until it receives the ACK for the TASK_FINISHED. > > I think we need to make them have consistent behavior. Any suggestions? > :-) > > > 1) update doesn't have UUID. > --> Is this possible with the HTTP executor? If not, we should probably > shutdown the executor instead of just ignoring. > > This is not possible with HTTP command executor, because in > `HttpCommandExecutor::update()` we always set UUID in the status upate before > sending it to agent. > > > 2) Framework is unknown. > --> Don't think we can do much here because we don't have access to > Executor object? > > I do not think this will happen in our case, because framework will only > be removed (`Slave::removeFramework()`) when it has no executor > (`framework->executors.empty()`), but in our case, it must have one executor > (the HTTP command executor). > > > > 4) Executor is unknown. While we forward the status update to the > framework, the ACK is dropped by _statusUpdate and ___statusUpdate() > > > --> If the update is generated by the agent (killTask(), _runTask()) > we should be fine because there is no executor. > > > --> If the update is sent by an executor for a task belonging to > another executor, that another executor is hopefully already dead. So we > should be fine. > > > --> Is it possible for the update to be sent by an executor that is > not known to the agent? Don't think we can do much here since we don't have > access to the Executor object. > > I do not think it is possible that an executor sends a update to agent > but agent does not know this executor.
There might be no cases today where executor doesn't get an ACK, but I would still prefer for us to be safe here incase there is a bug in the slave now or in the future. So lets add a timeout to self terminate with a comment saying that we do that to be safe. The executor can exit with non-zero status so that we can catch such situations. - Vinod ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/46187/#review128916 ----------------------------------------------------------- On April 27, 2016, 1:23 a.m., Qian Zhang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/46187/ > ----------------------------------------------------------- > > (Updated April 27, 2016, 1:23 a.m.) > > > Review request for mesos, Anand Mazumdar and Vinod Kone. > > > Bugs: MESOS-4855 > https://issues.apache.org/jira/browse/MESOS-4855 > > > Repository: mesos > > > Description > ------- > > Terminate when receiving the ACK of terminal status update. > > > Diffs > ----- > > src/launcher/http_command_executor.cpp > 0b4136c040ec9cf585c0d38f8471495a855cd640 > > Diff: https://reviews.apache.org/r/46187/diff/ > > > Testing > ------- > > make check > > > Thanks, > > Qian Zhang > >