> On April 21, 2015, 11:25 p.m., Jie Yu wrote: > > src/slave/slave.cpp, lines 3065-3078 > > <https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065> > > > > Instead of doing that in your way, can we just try to make sure > > `containerizer->wait` here will return a failure (or a Termination with > > some reason) when `containerizer->launch` fails. In that way, the > > `executorTerminated` will properly send status updates to the slave > > (TASK_LOST/TASK_FAILED). > > > > Or am I missing something? > > Jie Yu wrote: > OK, I think I got confused by the ticket. There are actually two problems > here. The problem I am refering to is the fact that we don't send status > update to the scheduler if containerizer launch fails until executor > reregistration timeout happens. Since for docker containerizer, someone might > use a very large timeout value, ideally, the slave should send a status > update to the scheduler right after containerizer launch fails. > > After chat with Jay, the problem you guys are refering to is the fact > that the scheduler cannot disinguish between the case where the task has > failed vs. the case where the configuration of a task is not correct, because > in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
To address the first problem, I think the simplest way is to add a containerizer->destroy(..) in executorLaunched when containerizer->launch fails. In that way, it's going to trigger containerizer->wait and thus send status update to the scheduler. - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review81090 ----------------------------------------------------------- On April 21, 2015, 5:14 p.m., Jay Buffington wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/33249/ > ----------------------------------------------------------- > > (Updated April 21, 2015, 5:14 p.m.) > > > Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. > > > Bugs: MESOS-2020 > https://issues.apache.org/jira/browse/MESOS-2020 > > > Repository: mesos > > > Description > ------- > > When mesos is unable to launch the containerizer the scheduler should > get a TASK_FAILED with a status message that includes the error the > containerizer encounted when trying to launch. > > Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED > > Fixes MESOS-2020 > > > Diffs > ----- > > include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 > src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd > src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 > > Diff: https://reviews.apache.org/r/33249/diff/ > > > Testing > ------- > > I added test case to slave_test.cpp. I also tried this with Aurora, supplied > a bogus docker image url and saw the "docker pull" failure stderr message in > Aurora's web UI. > > > Thanks, > > Jay Buffington > >