Till Toenshoff created MESOS-1431: ------------------------------------- Summary: io::splice usage needs special care - especially in connection with process::subprocess Key: MESOS-1431 URL: https://issues.apache.org/jira/browse/MESOS-1431 Project: Mesos Issue Type: Bug Affects Versions: 0.19.0 Reporter: Till Toenshoff
When using {{process::subproces}} in connection with {{io::splice}}, make sure you work with extra care. Consider this piece of code (ripped from EC), which looks rather unspectacular - very straightforward use of subprocess and splice. {noformat} // Fork exec of external process. Try<Subprocess> external = process::subprocess( execute, environment); if (external.isError()) { return Error("Failed to execute external containerizer: " + external.error()); } // Set stderr into non-blocking mode. Try<Nothing> nonblock = os::nonblock(external.get().err()); if (nonblock.isError()) { return Error("Failed to accept nonblock: " + nonblock.error()); } // Redirect output (stderr) from the subprocess to a log // file. Try<int> err = os::open( sandbox.isSome() ? path::join(sandbox.get().directory, "stderr") : "/dev/null", O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK, S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO); if (err.isError()) { return Error( "Failed to redirect stderr: Failed to open: " + err.error()); } Try<Nothing> cloexec = os::cloexec(err.get()); if (cloexec.isError()) { os::close(err.get()); return Error( "Failed to redirect stderr: Failed to set close-on-exec: " + cloexec.error()); } io::splice(external.get().err(), err.get()) .onAny(&os::close, err.get())); {noformat} The above code is buggy as subprocess will close {{external.get().err()}} once the child got reaped. So the FD we are reading (splicing) from potentially gets closed before the splicer was able to get the {{EOF}}. That in turn will cause libev to continue polling that FD (remember, it never reached a final state) and that is where havoc breaks lose as we will now see data getting lost that is sent from reused FDs. If you do not {{dup}} the subprocess returned FDs before using an {{io::splice}} on it, thus not taking full ownership of the FD, you will end up with fancy problems. So a fix would be replacing the last two lines by this: {noformat} // Duplicate 'from' so that we're in control of it's lifetime, // exceeding the lifetime of the reaped subprocess. It is needed // to make sure the splicer had a chance to read and process the // EOF. int fd = dup(external.get().err()); if (fd == -1) { os::close(err.get()); return ErrnoError("Failed to duplicate stderr file descriptor"); } io::splice(fd, err.get()) .onAny(bind(&_invoke, fd, err.get())); {noformat} And also adding an onAny triggered function (_invoke in this case) that is closing both, the dup'ed FD and the log-file FD. The bad news are that the EC is not the only implementation that contains such buggy code. There is at least one more such bug found within mesos_containerizer.cpp:680 and we should now all check for more such use-cases and fix them before releasing. -- This message was sent by Atlassian JIRA (v6.2#6252)