[ https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jord Sonneveld updated MESOS-3706: ---------------------------------- Description: I have a docker image which starts fine on all my slaves except for one. On that one, it is stuck in STAGING for a long time and never starts. The INFO log is full of messages like this: I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 20150109-172016-504433162-5050-19367-0002 E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: Transport endpoint is not connected [107] kwe-vinland-work is the task that is stuck in staging. It is launched by marathon. I have launched 161 instances successfully on my cluster. But it refuses to launch on this specific slave. These machines are all managed via ansible so their configurations are / should be identical. I have re-run my ansible scripts and rebooted the machines to no avail. It's been in this state for almost 30 minutes. You can see the mesos docker executor is still running: jord@dalstgmesos03:~$ date Mon Oct 12 16:13:55 UTC 2015 jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00 mesos-docker-executor --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe --stop_timeout=0ns was: I have a docker image which starts fine on all my slaves except for one. On that one, it is stuck in STAGING for a long time and never starts. The INFO log is full of messages like this: I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework 20150109-172016-504433162-5050-19367-0002 E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: Transport endpoint is not connected [107] kwe-vinland-work is the task that is stuck in staging. It is launched by marathon. I have launched 161 instances successfully on my cluster. But it refuses to launch on this specific slave. These machines are all managed via ansible so their configurations are / should be identical. I have re-run my ansible scripts and rebooted the machines to no avail. It's been in this state for almost 30 minutes. You can see the mesos docker executor is still running: jord@dalstgmesos03:~$ date Mon Oct 12 16:13:55 UTC 2015 jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00 mesos-docker-executor --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe --stop_timeout=0ns > Tasks stuck in staging. > ----------------------- > > Key: MESOS-3706 > URL: https://issues.apache.org/jira/browse/MESOS-3706 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.23.0, 0.24.1 > Reporter: Jord Sonneveld > Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, > mesos-slave.INFO > > > I have a docker image which starts fine on all my slaves except for one. On > that one, it is stuck in STAGING for a long time and never starts. The INFO > log is full of messages like this: > I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task > kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework > 20150109-172016-504433162-5050-19367-0002 > E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: > Transport endpoint is not connected [107] > kwe-vinland-work is the task that is stuck in staging. It is launched by > marathon. I have launched 161 instances successfully on my cluster. But it > refuses to launch on this specific slave. > These machines are all managed via ansible so their configurations are / > should be identical. I have re-run my ansible scripts and rebooted the > machines to no avail. > It's been in this state for almost 30 minutes. You can see the mesos docker > executor is still running: > jord@dalstgmesos03:~$ date > Mon Oct 12 16:13:55 UTC 2015 > jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland > root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00 > mesos-docker-executor > --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox > --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --stop_timeout=0ns -- This message was sent by Atlassian JIRA (v6.3.4#6332)