[ https://issues.apache.org/jira/browse/MESOS-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler reassigned MESOS-5195: -------------------------------------- Assignee: Benjamin Mahler This looks to be a duplicate of MESOS-4279, I'll take this on since we don't currently have a responsive maintainer for the docker support. > Docker executor: task logs lost on shutdown > ------------------------------------------- > > Key: MESOS-5195 > URL: https://issues.apache.org/jira/browse/MESOS-5195 > Project: Mesos > Issue Type: Bug > Components: containerization, docker > Affects Versions: 0.27.2 > Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS" > Reporter: Steven Schlansker > Assignee: Benjamin Mahler > Fix For: 1.0.0 > > > When you try to kill a task running in the Docker executor (in our case via > Singularity), the task shuts down cleanly but the last logs to standard out / > standard error are lost in teardown. > For example, we run dumb-init. With debugging on, you can see it should > write: > {noformat} > DEBUG("Forwarded signal %d to children.\n", signum); > {noformat} > If you attach strace to the process, you can see it clearly writes the text > to stderr. But that message is lost and never is written to the sandbox > 'stderr' file. > We believe the issue starts here, in Docker executor.cpp: > {code} > void shutdown(ExecutorDriver* driver) > { > cout << "Shutting down" << endl; > if (run.isSome() && !killed) { > // The docker daemon might still be in progress starting the > // container, therefore we kill both the docker run process > // and also ask the daemon to stop the container. > // Making a mutable copy of the future so we can call discard. > Future<Nothing>(run.get()).discard(); > stop = docker->stop(containerName, stopTimeout); > killed = true; > } > } > {code} > Notice how the "run" future is discarded *before* the Docker daemon is told > to stop -- now what will discarding it do? > {code} > void commandDiscarded(const Subprocess& s, const string& cmd) > { > VLOG(1) << "'" << cmd << "' is being discarded"; > os::killtree(s.pid(), SIGKILL); > } > {code} > Oops, just sent SIGKILL to the entire process tree... > You can see another (harmless?) side effect in the Docker daemon logs, it > never gets a chance to kill the task: > {noformat} > ERROR Handler for DELETE > /v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 > returned error: No such container: > mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 > {noformat} > I suspect that the fix is wait for 'docker->stop()' to complete before > discarding the 'run' future. > Happy to provide more information if necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)