Re: MESOS-4581 Docker containers getting stuck in staging.

haosdent Wed, 03 Feb 2016 07:13:03 -0800

Hi, Hegner. I think Joly means we change Docker::inspect from
```
  s.get().status()
    .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get());
});
```
to
```
  s.get().status()
    .after(timeoutSeconds, []() {...})
    .onAny([=]() { __inspect(cmd, promise, retryInterval, output, s.get());
});
```


Suppose if 500ms not works in different environments, we still would go
into same problem. I think add timeout limit in inspect would be better.

On Wed, Feb 3, 2016 at 10:05 PM, Hegner, Travis <theg...@trilliumit.com>
wrote:

> Thanks Jojy,
>
> I agree that my current patch is not the most ideal solution, as it was
> intended as a proof of concept to ensure that it was in fact the timing
> that was causing the problems I've been experiencing. Perhaps I didn't make
> that clear enough in the JIRA. That is also the reason I haven't attempted
> to submit that patch back.
>
> I was hoping to open a dialog to discuss the most appropriate fix for that
> error, as I am relatively new to the future/promise paradigm, and I haven't
> done much in c++ in 10 years.
>
> I'm assuming that when you say "using 'after' on the future" you are
> referring to the initial inspect call's future correct? In that case, I'm
> not convinced that is the right solution either. If you dig into the
> inspect call, you'll see that it has a fairly sophisticated retry system
> itself. However, that retry system is designed around whether the actual
> inspect command completes its execution. If the inspect command runs
> forever, then that thread blocks waiting for it forever.
>
> I am leaning towards a combination of both. I believe that it is
> appropriate to wait a small amount of time to call the initial inspect, to
> give docker a chance to create and start the container. I see your point in
> that there is still the possibility of a race condition, if the run command
> is delayed for the right amount of time itself, I understand that the run
> thread could be executed any time after the future is created. When running
> the vanilla mesos (without my current patch), remember that the containers
> that launch successfully, actually do so because the initial inspect fails
> (returns non 0 as the container doesn’t exist yet), and the inspect
> function's built in retry system delays 500ms and runs again, this time
> successfully. (See these logs to see what I mean:
> https://gist.github.com/travishegner/dcb5797f5a0919e45618)
>
> With that in mind, we wouldn't actually know whether the inspect is just
> delayed in its own retry pattern, or is actually hung, unless we wait a
> large enough amount of time. Given how frequently we encounter this issue
> in our environment (more than half of the attempted container launches),
> waiting a large amount of time to indicate whether the container is running
> would be undesirable.
>
> How do you feel about an initial delay (how long 500ms?), in concert with
> an ".after()" (how long 3000ms?) on the inspect future that will discard
> the original inspect and try again? The initial delay would give docker
> time to create and start our container, and in my environment drastically
> reduce the frequency with which we are met with this condition. Ironically,
> the variable name (DOCKER_INSPECT_DELAY) of the retry delay, seems to
> indicate that the author expected a delay to be inserted before the inspect
> call anyway, but in reading through the code, that variable is only used as
> the retry interval, and the initial docker inspect is executed immediately.
> A timeout/retry on the inspect call helps to guard against this type of
> issue, should the run and inspect commands be launched simultaneously for
> whatever reason.
>
> A final thought: are we looking at this from the wrong perspective? Should
> the inspect call itself be modified to handle the case when the inspect
> command never returns?
>
> Thanks for your thoughts,
>
> Travis
>
> -----Original Message-----
> From: Jojy Varghese [mailto:j...@mesosphere.io]
> Sent: Tuesday, February 2, 2016 10:34 PM
> To: dev@mesos.apache.org
> Subject: Re: MESOS-4581 Docker containers getting stuck in staging.
>
> Hi Travis
>  Thanks for narrowing down the issue. I had a brief look at your patch and
> it looks like it relies on adding delay before inspect is called. Although
> that might work mostly, I am wondering if that is the right solution. It
> would be better if we can have a timeout (using ‘after’ on the future) and
> retry inspect after timeout. We will have to discard the inspect future
> thats in flight.
>
> -Jojy
>
>
> > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com>
> wrote:
> >
> > I'd like to initiate a discussion on the following issue:
> >
> > https://issues.apache.org/jira/browse/MESOS-4581
> >
> > I've included a lot of detail in the JIRA, and would rather not
> reiterate _all_ of it here on the list, but in short:
> >
> > We are experiencing an issue when launching docker containers from
> marathon on mesos, where the container actually starts on the slave node to
> which it's assigned, but mesos/marathon get stuck in staging/staged
> respectively until the task launch times out and system tries again to
> launch it elsewhere. This issue is random in nature, successfully starting
> tasks about 40-50% of the time, while the rest of the time getting stuck.
> >
> > We've been able to narrow this down to a possible race condition likely
> in docker itself, but being triggered by the mesos-docker-executor. I have
> written and tested a patch in our environment which seems to have
> eliminated the issue, however I feel that the patch could be made more
> robust, and is currently just a work-around.
> >
> > Thanks for your time and consideration of the issue.
> >
> > Travis
>
>


-- 
Best Regards,
Haosdent Huang

Re: MESOS-4581 Docker containers getting stuck in staging.

Reply via email to