[ https://issues.apache.org/jira/browse/MESOS-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957348#comment-16957348 ]
Charles commented on MESOS-10007: --------------------------------- [~bmahler] Thanks, the patch works :). > random "Failed to get exit status for Command" for short-lived commands > ----------------------------------------------------------------------- > > Key: MESOS-10007 > URL: https://issues.apache.org/jira/browse/MESOS-10007 > Project: Mesos > Issue Type: Bug > Components: executor, libprocess > Reporter: Charles > Priority: Major > Labels: foundations > Attachments: > 0001-Avoid-double-reaping-race-in-the-command-executor.patch, > executor_race_reprod.diff, test_scheduler.py > > > Hi, > While testing Mesos to see if we could use it at work, I encountered a random > bug which I believe happens when a command exits really quickly, when run via > the command executor. > See the attached test case, but basically all it does is constantly start > "exit 0" tasks. > At some point, a task randomly fails with the error "Failed to get exit > status for Command": > > {noformat} > 'state': 'TASK_FAILED', 'message': 'Failed to get exit status for Command', > 'source': 'SOURCE_EXECUTOR',{noformat} > > I've had a look at the code, and I found something which could potentially > explain it - it's the first time I look at the code so apologies if I'm > missing something. > We can see the error originates from `reaped`: > [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L1017] > {noformat} > } else if (status_->isNone()) { > taskState = TASK_FAILED; > message = "Failed to get exit status for Command"; > } else {{noformat} > > Looking at the code, we can see that the `status_` future can be set to > `None` in `ReaperProcess::reap`: > [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L69] > > > {noformat} > Future<Option<int>> ReaperProcess::reap(pid_t pid) > { > // Check to see if this pid exists. > if (os::exists(pid)) { > Owned<Promise<Option<int>>> promise(new Promise<Option<int>>()); > promises.put(pid, promise); > return promise->future(); > } else { > return None(); > } > }{noformat} > > > So we could have this if the process has already been reaped (`kill -0` will > fail). > > Now, looking at the code path which spawns the process: > `launchTaskSubprocess` > [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L724] > > calls `subprocess`: > [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L315] > > If we look at the bottom of the function we can see the following: > [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462] > > > {noformat} > // We need to bind a copy of this Subprocess into the onAny callback > // below to ensure that we don't close the file descriptors before > // the subprocess has terminated (i.e., because the caller doesn't > // keep a copy of this Subprocess around themselves). > process::reap(process.data->pid) > .onAny(lambda::bind(internal::cleanup, lambda::_1, promise, process)); > return process;{noformat} > > > So at this point we've already called `process::reap`. > > And after that, the executor also calls `process::reap`: > [https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801] > > > {noformat} > // Monitor this process. > process::reap(pid.get()) > .onAny(defer(self(), &Self::reaped, pid.get(), lambda::_1));{noformat} > > > But if we look at the implementation of `process::reap`: > [https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L152] > > > {noformat} > Future<Option<int>> reap(pid_t pid) > { > // The reaper process is instantiated in `process::initialize`. > process::initialize(); return dispatch( > internal::reaper, > &internal::ReaperProcess::reap, > pid); > }{noformat} > We can see that `ReaperProcess::reap` is going to get called asynchronously. > > Doesn't this mean that it's possible that the first call to `reap` set up by > `subprocess` > ([https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462)|https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/subprocess.cpp#L462] > will get executed first, and if the task has already exited by that time, the > child will get reaped before the call to `reap` set up by the executor > ([https://github.com/apache/mesos/blob/master/src/launcher/executor.cpp#L801]) > gets a chance to run? > > In that case, when it runs > > {noformat} > if (os::exists(pid)) {{noformat} > would return false, `reap` would set the future to None which would result in > this error. > -- This message was sent by Atlassian Jira (v8.3.4#803005)