[ https://issues.apache.org/jira/browse/MESOS-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Toenshoff updated MESOS-1431: ---------------------------------- Assignee: Till Toenshoff > io::splice usage needs special care - especially in connection with > process::subprocess > --------------------------------------------------------------------------------------- > > Key: MESOS-1431 > URL: https://issues.apache.org/jira/browse/MESOS-1431 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.19.0 > Reporter: Till Toenshoff > Assignee: Till Toenshoff > Labels: bug, splice, subprocess > > When using {{process::subproces}} in connection with {{io::splice}}, make > sure you work with extra care. > Consider this piece of code (ripped from EC), which looks rather > unspectacular - very straightforward use of subprocess and splice. > {noformat} > // Fork exec of external process. > Try<Subprocess> external = process::subprocess( > execute, > environment); > if (external.isError()) { > return Error("Failed to execute external containerizer: " + > external.error()); > } > // Set stderr into non-blocking mode. > Try<Nothing> nonblock = os::nonblock(external.get().err()); > if (nonblock.isError()) { > return Error("Failed to accept nonblock: " + nonblock.error()); > } > // Redirect output (stderr) from the subprocess to a log > // file. > Try<int> err = os::open( > sandbox.isSome() ? path::join(sandbox.get().directory, "stderr") > : "/dev/null", > O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK, > S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO); > if (err.isError()) { > return Error( > "Failed to redirect stderr: Failed to open: " + > err.error()); > } > Try<Nothing> cloexec = os::cloexec(err.get()); > if (cloexec.isError()) { > os::close(err.get()); > return Error( > "Failed to redirect stderr: Failed to set close-on-exec: " + > cloexec.error()); > } > io::splice(external.get().err(), err.get()) > .onAny(&os::close, err.get())); > {noformat} > The above code is buggy as subprocess will close {{external.get().err()}} > once the child got reaped. So the FD we are reading (splicing) from > potentially gets closed before the splicer was able to get the {{EOF}}. That > in turn will cause libev to continue polling that FD (remember, it never > reached a final state) and that is where havoc breaks lose as we will now see > data getting lost that is sent from reused FDs. > > If you do not {{dup}} the subprocess returned FDs before using an > {{io::splice}} on it, thus not taking full ownership of the FD, you will end > up with fancy problems. > So a fix would be replacing the last two lines by this: > {noformat} > // Duplicate 'from' so that we're in control of it's lifetime, > // exceeding the lifetime of the reaped subprocess. It is needed > // to make sure the splicer had a chance to read and process the > // EOF. > int fd = dup(external.get().err()); > if (fd == -1) { > os::close(err.get()); > return ErrnoError("Failed to duplicate stderr file descriptor"); > } > io::splice(fd, err.get()) > .onAny(bind(&_invoke, fd, err.get())); > {noformat} > And also adding an onAny triggered function (_invoke in this case) that is > closing both, the dup'ed FD and the log-file FD. > The bad news are that the EC is not the only implementation that contains > such buggy code. There is at least one more such bug found within > mesos_containerizer.cpp:680 and we should now all check for more such > use-cases and fix them before releasing. -- This message was sent by Atlassian JIRA (v6.2#6252)