> On Feb. 28, 2018, 10:04 a.m., Greg Mann wrote: > > src/docker/executor.cpp > > Lines 242-247 (patched) > > <https://reviews.apache.org/r/65759/diff/3/?file=1965994#file1965994line242> > > > > Doesn't this render the `onFailed` callback registered on L357 useless? > > i.e., the `inspect` future will never transition to the failed state? > > > > If so, we should probably either remove the `onFailed` callback, or do > > this instead: > > ``` > > if (!future.hasDiscard()) { > > return Break(future); > > } > > ``` > > > > I guess the question is: if the inspect call actually fails, rather > > than hanging, do we want to retry? It looks like there are several cases in > > `Docker::inspect` which will result in a failure (failed to create > > subprocess, failed to read stdout, etc..), and it looks to me like we could > > probably just retry in those cases. WDYT? > > Andrei Budnik wrote: > Good point! If a docker daemon returns non-zero, the docker libray will > retry `inspect`, then we'll get a message kile: > `I0228 17:28:13.275115 3248 docker.cpp:1369] Retrying inspect with > non-zero status code. cmd: 'docker -H unix:///var/run/docker.sock inspect > mesos-210b988c-c808-47e5-af65-75f40269755b', interval: 500ms` > > But if the docker library returns a failure itself due to some severe bug > (failed to create subprocess, failed to read stdout, etc...), then IMO we > should stop retrying `inspect`: > > ``` > [](const Future<Docker::Container>& future) > -> Future<ControlFlow<Docker::Container>> { > if (future.isReady()) { > return Break(future.get()); > } > if (future.isFailed()) { > return Failure(future.failure()); > } > return Continue(); > }); > ```
SGTM! - Greg ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65759/#review198382 ----------------------------------------------------------- On Feb. 22, 2018, 8:32 p.m., Andrei Budnik wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65759/ > ----------------------------------------------------------- > > (Updated Feb. 22, 2018, 8:32 p.m.) > > > Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and > Michael Park. > > > Bugs: MESOS-8574 > https://issues.apache.org/jira/browse/MESOS-8574 > > > Repository: mesos > > > Description > ------- > > This patch adds retries for `inspect` command to workaround docker > daemon hangs. We assume that the docker daemon can be temporarily > unresponsive. If it's unresponsive, then any started docker cli > command hangs. To address the issue, we retry `inspect` in the loop. > > > Diffs > ----- > > src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 > > > Diff: https://reviews.apache.org/r/65759/diff/4/ > > > Testing > ------- > > internal CI > > Manually, described in /r/65713 > > > Thanks, > > Andrei Budnik > >