[ 
https://issues.apache.org/jira/browse/MESOS-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981249#comment-14981249
 ] 

Chris Fortier commented on MESOS-3808:
--------------------------------------

Gilbert,

I think I have it narrowed down to this block of code: 
https://github.com/cfortier2/mesos/blob/master/src/slave/containerizer/docker.cpp#L1492

It seems that the `else` block is being called but it is trying to stop the 
container with only the id. Any advice on how to fix this?

> slave/containerizer/docker leaves orphan containers on restart of mesos-slave
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-3808
>                 URL: https://issues.apache.org/jira/browse/MESOS-3808
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker, slave
>    Affects Versions: 0.25.0
>         Environment: CoreOS. Running mesos-slave in a container.
>            Reporter: Chris Fortier
>            Assignee: Gilbert Song
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker 
> containers launched by Mesos were being orphaned and not destroyed when the 
> Mesos agent was restarted.
> Relavent log output:
> {noformat}
> I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
> I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container 
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor 
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container 
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor 
> 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to 
> executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
> I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to 
> executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
> I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container 
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
> E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: 
> Transport endpoint is not connected [107]
> I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container 
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container 
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
> I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container 
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: 
> Transport endpoint is not connected [107]
> I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container 
> 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container 
> '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> I1027 20:36:22.518592 23005 slave.cpp:3433] Executor 
> 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST 
> (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task 
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> I1027 20:36:22.519263 23005 slave.cpp:3433] Executor 
> 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 
> 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST 
> (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task 
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown 
> container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
> W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown 
> container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
> I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status 
> update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task 
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing 
> UPDATE for status update TASK_LOST (UUID: 
> b07be363-433f-4a11-8c81-1f5787debc76) for task 
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status 
> update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task 
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing 
> UPDATE for status update TASK_LOST (UUID: 
> 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task 
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000
> W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST 
> (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task 
> ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because 
> the slave is in RECOVERING state
> W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST 
> (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task 
> ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 
> 20151016-161150-1902412554-5050-1-0000 sent by status update manager because 
> the slave is in RECOVERING state
> I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered 
> executors
> I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
> {noformat}
> Docker output:
> {noformat}
> CONTAINER ID        IMAGE                             COMMAND                
> CREATED              STATUS              PORTS               NAMES
> 8d0d69fe34d7        libmesos/ubuntu                   "/bin/sh -c 'while s   
> About a minute ago   Up About a minute                       
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
> e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   
> About a minute ago   Up About a minute                       
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
> 3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   
> 4 minutes ago        Up 4 minutes                            
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
> 5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   
> 4 minutes ago        Up 4 minutes                            
> mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
> {noformat}
> After digging in to the issue it seems the below comment might be the 
> problem. 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97
> It appears that the recovery command is still only sending the containerId 
> and not the frameworkId + containerId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to