[ https://issues.apache.org/jira/browse/MESOS-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jord Sonneveld updated MESOS-3706: ---------------------------------- Attachment: stderr stdout I have attached the stdout/stderr files. They are not the specific ones you wanted (I ended up purging the mesos work dir to see if that would help at all) but they are representative of these files in the past. As you can see, there is not very much in them. > Tasks stuck in staging. > ----------------------- > > Key: MESOS-3706 > URL: https://issues.apache.org/jira/browse/MESOS-3706 > Project: Mesos > Issue Type: Bug > Components: docker, slave > Affects Versions: 0.23.0, 0.24.1 > Reporter: Jord Sonneveld > Attachments: Screen Shot 2015-10-12 at 9.08.30 AM.png, Screen Shot > 2015-10-12 at 9.24.32 AM.png, mesos-slave.INFO, mesos-slave.INFO.2, > mesos-slave.INFO.3, stderr, stdout > > > I have a docker image which starts fine on all my slaves except for one. On > that one, it is stuck in STAGING for a long time and never starts. The INFO > log is full of messages like this: > I1012 16:02:09.210306 34905 slave.cpp:1768] Asked to kill task > kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72 of framework > 20150109-172016-504433162-5050-19367-0002 > E1012 16:02:09.211272 34907 socket.hpp:174] Shutdown failed on fd=12: > Transport endpoint is not connected [107] > kwe-vinland-work is the task that is stuck in staging. It is launched by > marathon. I have launched 161 instances successfully on my cluster. But it > refuses to launch on this specific slave. > These machines are all managed via ansible so their configurations are / > should be identical. I have re-run my ansible scripts and rebooted the > machines to no avail. > It's been in this state for almost 30 minutes. You can see the mesos docker > executor is still running: > jord@dalstgmesos03:~$ date > Mon Oct 12 16:13:55 UTC 2015 > jord@dalstgmesos03:~$ ps auwx | grep kwe-vinland > root 35360 0.0 0.0 1070576 21476 ? Ssl 15:46 0:00 > mesos-docker-executor > --container=mesos-20151012-082619-4145023498-5050-22623-S0.0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --docker=docker --help=false --mapped_directory=/mnt/mesos/sandbox > --sandbox_directory=/data/mesos/mesos/work/slaves/20151012-082619-4145023498-5050-22623-S0/frameworks/20150109-172016-504433162-5050-19367-0002/executors/kwe-vinland-work.6c939697-70f8-11e5-845c-0242e054dd72/runs/0695c9e0-0adf-4dfb-bc2a-6060245dcabe > --stop_timeout=0ns > According to docker ps -a, nothing was ever even launched: > jord@dalstgmesos03:/data/mesos$ sudo docker ps -a > CONTAINER ID IMAGE > COMMAND CREATED STATUS PORTS > NAMES > 5c858b90b0a0 registry.roger.dal.moz.com:5000/moz-statsd-v0.22 > "/bin/sh -c ./start.s" 39 minutes ago Up 39 minutes > 0.0.0.0:9125->8125/udp, 0.0.0.0:9126->8126/tcp statsd-fe-influxdb > d765ba3829fd registry.roger.dal.moz.com:5000/moz-statsd-v0.22 > "/bin/sh -c ./start.s" 41 minutes ago Up 41 minutes > 0.0.0.0:8125->8125/udp, 0.0.0.0:8126->8126/tcp statsd-repeater > Those are the only two entries. Nothing about the kwe-vinland job. -- This message was sent by Atlassian JIRA (v6.3.4#6332)