> On Feb. 28, 2018, 10:04 a.m., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 242-247 (patched)
> > <https://reviews.apache.org/r/65759/diff/3/?file=1965994#file1965994line242>
> >
> >     Doesn't this render the `onFailed` callback registered on L357 useless? 
> > i.e., the `inspect` future will never transition to the failed state?
> >     
> >     If so, we should probably either remove the `onFailed` callback, or do 
> > this instead:
> >     ```
> >     if (!future.hasDiscard()) {
> >       return Break(future);
> >     }
> >     ```
> >     
> >     I guess the question is: if the inspect call actually fails, rather 
> > than hanging, do we want to retry? It looks like there are several cases in 
> > `Docker::inspect` which will result in a failure (failed to create 
> > subprocess, failed to read stdout, etc..), and it looks to me like we could 
> > probably just retry in those cases. WDYT?
> 
> Andrei Budnik wrote:
>     Good point! If a docker daemon returns non-zero, the docker libray will 
> retry `inspect`, then we'll get a message kile: 
>     `I0228 17:28:13.275115  3248 docker.cpp:1369] Retrying inspect with 
> non-zero status code. cmd: 'docker -H unix:///var/run/docker.sock inspect 
> mesos-210b988c-c808-47e5-af65-75f40269755b', interval: 500ms`
>     
>     But if the docker library returns a failure itself due to some severe bug 
> (failed to create subprocess, failed to read stdout, etc...), then IMO we 
> should stop retrying `inspect`:
>     
>     ```
>            [](const Future<Docker::Container>& future)
>               -> Future<ControlFlow<Docker::Container>> {
>               if (future.isReady()) {
>                 return Break(future.get());
>               }
>               if (future.isFailed()) {
>                 return Failure(future.failure());
>               }
>               return Continue();
>             });
>     ```

SGTM!


- Greg


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65759/#review198382
-----------------------------------------------------------


On Feb. 22, 2018, 8:32 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65759/
> -----------------------------------------------------------
> 
> (Updated Feb. 22, 2018, 8:32 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Michael Park.
> 
> 
> Bugs: MESOS-8574
>     https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This patch adds retries for `inspect` command to workaround docker
> daemon hangs. We assume that the docker daemon can be temporarily
> unresponsive. If it's unresponsive, then any started docker cli
> command hangs. To address the issue, we retry `inspect` in the loop.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65759/diff/4/
> 
> 
> Testing
> -------
> 
> internal CI
> 
> Manually, described in /r/65713
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>

Reply via email to