[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077121#comment-14077121
 ] 

Ian Downes commented on MESOS-1199:
-----------------------------------

I committed os::exists() in stout/os/exists.hpp which uses kill(pid, 0), making 
the process liveness check much more efficient (previously it was constructing 
a full Process object from /proc).

I'm suggesting the polling interval be reduced from 1 second to 50 or 100 ms as 
an easy fix to get an order of magnitude reduction in wait latency. I'd like to 
see us to do this before embarking on a complicated path which could only be 
used when the slave is the actual parent, i.e., we'd have to fall back to 
polling anyway after a slave restart.

> Subprocess is "slow" -> gated by process::reap poll interval
> ------------------------------------------------------------
>
>                 Key: MESOS-1199
>                 URL: https://issues.apache.org/jira/browse/MESOS-1199
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.18.0
>            Reporter: Ian Downes
>            Assignee: Craig Hansen-Sturm
>
> Subprocess uses process::reap to wait on the subprocess pid and set the exit 
> status. However, process::reap polls with a one second interval resulting in 
> a delay up to the interval duration before the status future is set.
> This means if you need to wait for the subprocess to complete you get hit 
> with E(delay) = 0.5 seconds, independent of the execution time. For example, 
> the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
> executor during launch. At Twitter we fetch a local file, i.e., a very fast 
> operation, but the launch is blocked until the mesos-fetcher pid is reaped -> 
> adding 0 to 1 seconds for every launch!
> The problem is even worse with a chain of short Subprocesses because after 
> the first Subprocess completes you'll be synchronized with the reap interval 
> and you'll see nearly the full interval before notification, i.e., 10 
> Subprocesses each of << 1 second duration with take ~10 seconds!
> This has become particularly apparent in some new tests I'm working on where 
> test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to