[ https://issues.apache.org/jira/browse/MESOS-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981249#comment-14981249 ]
Chris Fortier commented on MESOS-3808: -------------------------------------- Gilbert, I think I have it narrowed down to this block of code: https://github.com/cfortier2/mesos/blob/master/src/slave/containerizer/docker.cpp#L1492 It seems that the `else` block is being called but it is trying to stop the container with only the id. Any advice on how to fix this? > slave/containerizer/docker leaves orphan containers on restart of mesos-slave > ----------------------------------------------------------------------------- > > Key: MESOS-3808 > URL: https://issues.apache.org/jira/browse/MESOS-3808 > Project: Mesos > Issue Type: Bug > Components: containerization, docker, slave > Affects Versions: 0.25.0 > Environment: CoreOS. Running mesos-slave in a container. > Reporter: Chris Fortier > Assignee: Gilbert Song > Original Estimate: 4h > Remaining Estimate: 4h > > We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker > containers launched by Mesos were being orphaned and not destroyed when the > Mesos agent was restarted. > Relavent log output: > {noformat} > I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers > I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container > 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor > 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework > 20151016-161150-1902412554-5050-1-0000 > I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container > '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor > 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework > 20151016-161150-1902412554-5050-1-0000 > I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to > executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596 > I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to > executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469 > I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container > 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited > E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: > Transport endpoint is not connected [107] > I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container > 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' > I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container > '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited > I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container > '77b1748e-f295-4eb5-9966-d7a3bba2fc31' > E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: > Transport endpoint is not connected [107] > I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container > 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' > I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container > '77b1748e-f295-4eb5-9966-d7a3bba2fc31' > I1027 20:36:22.518592 23005 slave.cpp:3433] Executor > 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework > 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status > I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST > (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task > ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0 > I1027 20:36:22.519263 23005 slave.cpp:3433] Executor > 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework > 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status > I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST > (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task > ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0 > W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown > container: a2308dfc-ec2f-4687-ae92-f045dd2d3614 > W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown > container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31 > I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status > update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task > ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 > I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing > UPDATE for status update TASK_LOST (UUID: > b07be363-433f-4a11-8c81-1f5787debc76) for task > ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 > I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status > update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task > ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 > I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing > UPDATE for status update TASK_LOST (UUID: > 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task > ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 > W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST > (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task > ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 sent by status update manager because > the slave is in RECOVERING state > W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST > (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task > ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework > 20151016-161150-1902412554-5050-1-0000 sent by status update manager because > the slave is in RECOVERING state > I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered > executors > I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery > {noformat} > Docker output: > {noformat} > CONTAINER ID IMAGE COMMAND > CREATED STATUS PORTS NAMES > 8d0d69fe34d7 libmesos/ubuntu "/bin/sh -c 'while s > About a minute ago Up About a minute > mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31 > e4344cfbcc6d libmesos/ubuntu "/bin/sh -c 'while s > About a minute ago Up About a minute > mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2 > 3ce690f3b872 libmesos/ubuntu "/bin/sh -c 'while s > 4 minutes ago Up 4 minutes > mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614 > 5b4546d3087a libmesos/ubuntu "/bin/sh -c 'while s > 4 minutes ago Up 4 minutes > mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31 > {noformat} > After digging in to the issue it seems the below comment might be the > problem. > https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97 > It appears that the recovery command is still only sending the containerId > and not the frameworkId + containerId. -- This message was sent by Atlassian JIRA (v6.3.4#6332)