Thanks Jojy,

I agree that my current patch is not the most ideal solution, as it was 
intended as a proof of concept to ensure that it was in fact the timing that 
was causing the problems I've been experiencing. Perhaps I didn't make that 
clear enough in the JIRA. That is also the reason I haven't attempted to submit 
that patch back.

I was hoping to open a dialog to discuss the most appropriate fix for that 
error, as I am relatively new to the future/promise paradigm, and I haven't 
done much in c++ in 10 years.

I'm assuming that when you say "using 'after' on the future" you are referring 
to the initial inspect call's future correct? In that case, I'm not convinced 
that is the right solution either. If you dig into the inspect call, you'll see 
that it has a fairly sophisticated retry system itself. However, that retry 
system is designed around whether the actual inspect command completes its 
execution. If the inspect command runs forever, then that thread blocks waiting 
for it forever.

I am leaning towards a combination of both. I believe that it is appropriate to 
wait a small amount of time to call the initial inspect, to give docker a 
chance to create and start the container. I see your point in that there is 
still the possibility of a race condition, if the run command is delayed for 
the right amount of time itself, I understand that the run thread could be 
executed any time after the future is created. When running the vanilla mesos 
(without my current patch), remember that the containers that launch 
successfully, actually do so because the initial inspect fails (returns non 0 
as the container doesn’t exist yet), and the inspect function's built in retry 
system delays 500ms and runs again, this time successfully. (See these logs to 
see what I mean: https://gist.github.com/travishegner/dcb5797f5a0919e45618)

With that in mind, we wouldn't actually know whether the inspect is just 
delayed in its own retry pattern, or is actually hung, unless we wait a large 
enough amount of time. Given how frequently we encounter this issue in our 
environment (more than half of the attempted container launches), waiting a 
large amount of time to indicate whether the container is running would be 
undesirable.

How do you feel about an initial delay (how long 500ms?), in concert with an 
".after()" (how long 3000ms?) on the inspect future that will discard the 
original inspect and try again? The initial delay would give docker time to 
create and start our container, and in my environment drastically reduce the 
frequency with which we are met with this condition. Ironically, the variable 
name (DOCKER_INSPECT_DELAY) of the retry delay, seems to indicate that the 
author expected a delay to be inserted before the inspect call anyway, but in 
reading through the code, that variable is only used as the retry interval, and 
the initial docker inspect is executed immediately. A timeout/retry on the 
inspect call helps to guard against this type of issue, should the run and 
inspect commands be launched simultaneously for whatever reason.

A final thought: are we looking at this from the wrong perspective? Should the 
inspect call itself be modified to handle the case when the inspect command 
never returns?

Thanks for your thoughts,

Travis

-----Original Message-----
From: Jojy Varghese [mailto:j...@mesosphere.io] 
Sent: Tuesday, February 2, 2016 10:34 PM
To: dev@mesos.apache.org
Subject: Re: MESOS-4581 Docker containers getting stuck in staging.

Hi Travis
 Thanks for narrowing down the issue. I had a brief look at your patch and it 
looks like it relies on adding delay before inspect is called. Although that 
might work mostly, I am wondering if that is the right solution. It would be 
better if we can have a timeout (using ‘after’ on the future) and retry inspect 
after timeout. We will have to discard the inspect future thats in flight.

-Jojy


> On Feb 2, 2016, at 1:12 PM, Hegner, Travis <theg...@trilliumit.com> wrote:
> 
> I'd like to initiate a discussion on the following issue:
> 
> https://issues.apache.org/jira/browse/MESOS-4581
> 
> I've included a lot of detail in the JIRA, and would rather not reiterate 
> _all_ of it here on the list, but in short:
> 
> We are experiencing an issue when launching docker containers from marathon 
> on mesos, where the container actually starts on the slave node to which it's 
> assigned, but mesos/marathon get stuck in staging/staged respectively until 
> the task launch times out and system tries again to launch it elsewhere. This 
> issue is random in nature, successfully starting tasks about 40-50% of the 
> time, while the rest of the time getting stuck.
> 
> We've been able to narrow this down to a possible race condition likely in 
> docker itself, but being triggered by the mesos-docker-executor. I have 
> written and tested a patch in our environment which seems to have eliminated 
> the issue, however I feel that the patch could be made more robust, and is 
> currently just a work-around.
> 
> Thanks for your time and consideration of the issue.
> 
> Travis

Reply via email to