Till Toenshoff created MESOS-1431:
-------------------------------------

             Summary: io::splice usage needs special care - especially in 
connection with process::subprocess
                 Key: MESOS-1431
                 URL: https://issues.apache.org/jira/browse/MESOS-1431
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.19.0
            Reporter: Till Toenshoff


When using {{process::subproces}} in connection with {{io::splice}}, make sure 
you work with extra care. 

Consider this piece of code (ripped from EC), which looks rather unspectacular 
- very straightforward use of subprocess and splice.

{noformat}
  // Fork exec of external process.
  Try<Subprocess> external = process::subprocess(
      execute,
      environment);

  if (external.isError()) {
    return Error("Failed to execute external containerizer: " +
                 external.error());
  }

  // Set stderr into non-blocking mode.
  Try<Nothing> nonblock = os::nonblock(external.get().err());
  if (nonblock.isError()) {
    return Error("Failed to accept nonblock: " + nonblock.error());
  }

  // Redirect output (stderr) from the subprocess to a log
  // file.
  Try<int> err = os::open(
      sandbox.isSome() ? path::join(sandbox.get().directory, "stderr")
                       : "/dev/null",
      O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK,
      S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO);
  if (err.isError()) {
    return Error(
        "Failed to redirect stderr: Failed to open: " +
        err.error());
  }

  Try<Nothing> cloexec = os::cloexec(err.get());
  if (cloexec.isError()) {
    os::close(err.get());
    return Error(
        "Failed to redirect stderr: Failed to set close-on-exec: " +
        cloexec.error());
  }

  io::splice(external.get().err(), err.get())
    .onAny(&os::close, err.get()));
{noformat}

The above code is buggy as subprocess will close {{external.get().err()}} once 
the child got reaped. So the FD we are reading (splicing) from potentially gets 
closed before the splicer was able to get the {{EOF}}. That in turn will cause 
libev to continue polling that FD (remember, it never reached a final state) 
and that is where havoc breaks lose as we will now see data getting lost that 
is sent from reused FDs.
 
If you do not {{dup}} the subprocess returned FDs before using an 
{{io::splice}} on it, thus not taking full ownership of the FD, you will end up 
with fancy problems.

So a fix would be replacing the last two lines by this:

{noformat}
  // Duplicate 'from' so that we're in control of it's lifetime,
  // exceeding the lifetime of the reaped subprocess. It is needed
  // to make sure the splicer had a chance to read and process the
  // EOF.
  int fd = dup(external.get().err());
  if (fd == -1) {
    os::close(err.get());
    return ErrnoError("Failed to duplicate stderr file descriptor");
  }

  io::splice(fd, err.get())
    .onAny(bind(&_invoke, fd, err.get()));
{noformat}

And also adding an onAny triggered function (_invoke in this case) that is 
closing both, the dup'ed FD and the log-file FD.

The bad news are that the EC is not the only implementation that contains such 
buggy code. There is at least one more such bug found within 
mesos_containerizer.cpp:680 and we should now all check for more such use-cases 
and fix them before releasing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to