[ https://issues.apache.org/jira/browse/MESOS-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy Chen updated MESOS-1915: -------------------------------- Target Version/s: 0.21.0 > Docker containers that fail to launch are not killed > ---------------------------------------------------- > > Key: MESOS-1915 > URL: https://issues.apache.org/jira/browse/MESOS-1915 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 0.20.1 > Environment: Mesos 0.20.1 using the docker executor with a private > docker repository. Images often take up to 5 minutes to launch. > /etc/mesos-slave/executor_registration_timeout is set to '10mins' > Reporter: Daniel Hall > Assignee: Timothy Chen > > When we launch docker containers on our Mesos cluster using marathon we have > noticed that we end up with several docker containers running, with only one > of them actually being tracked my Mesos. When inspected the containers both > have the same start time. > This seems to be because Mesos gives up on trying to start the container > after 1min, but fails to clean up the docker container because it is is not > yet running. Eventually the container starts alongside all the other attempts > mesos has made and we end up with several containers running with only one > being tracked by Mesos. > I've pasted some logs from the slave below filter for that particular task, > but it is pretty easy to replicate in our environment so I'm happy to provide > further logs, details and analysis as required. This is becoming a bit > problem for us so we are happy to help as much as possible. > {noformat} > Oct 13 04:47:42 mesosslave-1 mesos-slave[16647]: I1013 04:47:42.776945 16661 > docker.cpp:743] Starting container 'dd113461-4d18-4170-8e3f-9527e6d7f598' for > task 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' (and executor > 'docker-test.11588a48-5294-11e4-adea-42010af0f51e') of framework > '20140918-022627-519434250-5050-6171-0000' > Oct 13 04:48:42 mesosslave-1 mesos-slave[16647]: E1013 04:48:42.819563 16664 > slave.cpp:2205] Failed to update resources for container > dd113461-4d18-4170-8e3f-9527e6d7f598 of executor > docker-test.11588a48-5294-11e4-adea-42010af0f51e running task > docker-test.11588a48-5294-11e4-adea-42010af0f51e on status update for > terminal task, destroying container: No container found > Oct 13 04:49:29 mesosslave-1 mesos-slave[16647]: I1013 04:49:29.916460 16665 > slave.cpp:2538] Monitoring executor > 'docker-test.11588a48-5294-11e4-adea-42010af0f51e' of framework > '20140918-022627-519434250-5050-6171-0000' in container > 'dd113461-4d18-4170-8e3f-9527e6d7f598' > Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.103175 16663 > docker.cpp:1286] Updated 'cpu.shares' to 102 at > /cgroup/cpu/docker/6a581f5c2174dc76bcfb2e5b89fd9a4310732c384d93901a8b37da8aeb700468 > for container dd113461-4d18-4170-8e3f-9527e6d7f598 > Oct 13 04:49:31 mesosslave-1 mesos-slave[16647]: I1013 04:49:31.105036 16663 > docker.cpp:1321] Updated 'memory.soft_limit_in_bytes' to 32MB for container > dd113461-4d18-4170-8e3f-9527e6d7f598 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)