[ https://issues.apache.org/jira/browse/MESOS-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990141#comment-13990141 ]
Till Toenshoff edited comment on MESOS-1243 at 5/6/14 12:38 AM: ---------------------------------------------------------------- Recovery: Right now {{recover}} is not container or executor specific, hence it shouldn't fail just because a single one wasn't recoverable for any reason. Let me draft this from the ExternalContainerizer's point of view in a failure scenario; Slave invokes {{launch}} and the EC tries to pass this on to the ECP. Now assume the slave dies prior to the ECP actually being able to launch anything. After a {{recover}} the slave now assumes that the ECP will be able to {{wait}} on that container. The ECP however never {{launch}} ed that container, hence it is unable to {{wait}}, thus is unable to return a {{Termination}}. So the problem here has to be seen specifically minding that the ECP and the slave may have differing status. The quick way out of this is to allow that {{Termination}} to be optional. Another way may be to make sure that the container is only checkpointed after a fully achieved launch? was (Author: tillt): Recovery: Right now {{recover}} is not container or executor specific, hence it shouldn't fail just because a single one wasn't recoverable for any reason. Let me draft this from the ExternalContainerizer's point of view in a failure scenario; Slave invokes {{launch}} and the EC tries to pass this on to the ECP. Now assume the slave dies prior to the ECP actually being able to launch anything. After a {{recover}} the slave now assumes that the ECP will be able to {{wait}} on that container. The ECP however never {{launch}}ed that container, hence it is unable to {{wait}}, thus is unable to return a {{Termination}}. So the problem here has to be seen specifically minding that the ECP and the slave may have differing status. The quick way out of this is to allow that {{Termination}} to be optional. Another way may be to make sure that the container is only checkpointed after a fully achieved launch? > Containerizer::wait return type should be Option<Termination> > ------------------------------------------------------------- > > Key: MESOS-1243 > URL: https://issues.apache.org/jira/browse/MESOS-1243 > Project: Mesos > Issue Type: Improvement > Reporter: Till Toenshoff > Priority: Minor > Labels: containerizer, external-containerizer, isolation, mesos, > mesos-containerizer > > The containerizer {{wait}} should return an {{Option<Termination>}} to > distinguish the case when it doesn't know about a {{ContainerID}}. -- This message was sent by Atlassian JIRA (v6.2#6252)