[ https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210909#comment-15210909 ]
Anand Mazumdar commented on MESOS-3573: --------------------------------------- [~bobrik] This shouldn't be a problem. The transient error that you are linking to happens due to this: When the agent is recovering, it tries to send a {{ReconnectExecutorMessage}} to reconnect with the executor. If for some reason, it fails like in your logs, probably due to the executor process itself being hung or executor process already exited. The agent would itself kill the container the executor is running in after 2 seconds (EXECUTOR_REREGISTER_TIMEOUT): https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4700 Of course, if the docker daemon is still stuck and the agent is not able to invoke {{docker->stop}} on the container, it would fail. We cannot do anything about that as remarked in 1. of my earlier comment. Let me know if you have any further queries. > Mesos does not kill orphaned docker containers > ---------------------------------------------- > > Key: MESOS-3573 > URL: https://issues.apache.org/jira/browse/MESOS-3573 > Project: Mesos > Issue Type: Bug > Components: docker, slave > Reporter: Ian Babrou > Assignee: Anand Mazumdar > Labels: mesosphere > > After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like > there were changes between 0.23.0 and 0.24.0 that broke cleanup. > Here's how to trigger this bug: > 1. Deploy app in docker container. > 2. Kill corresponding mesos-docker-executor process > 3. Observe hanging container > Here are the logs after kill: > {noformat} > slave_1 | I1002 12:12:59.362002 7791 docker.cpp:1576] Executor for > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited > slave_1 | I1002 12:12:59.362284 7791 docker.cpp:1374] Destroying > container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1 | I1002 12:12:59.363404 7791 docker.cpp:1478] Running docker stop > on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' > slave_1 | I1002 12:12:59.363876 7791 slave.cpp:3399] Executor > 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework > 20150923-122130-2153451692-5050-1-0000 terminated with signal Terminated > slave_1 | I1002 12:12:59.367570 7791 slave.cpp:2696] Handling status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 from @0.0.0.0:0 > slave_1 | I1002 12:12:59.367842 7791 slave.cpp:5094] Terminating task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c > slave_1 | W1002 12:12:59.368484 7791 docker.cpp:986] Ignoring updating > unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8 > slave_1 | I1002 12:12:59.368671 7791 status_update_manager.cpp:322] > Received status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 > slave_1 | I1002 12:12:59.368741 7791 status_update_manager.cpp:826] > Checkpointing UPDATE for status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 > slave_1 | I1002 12:12:59.370636 7791 status_update_manager.cpp:376] > Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) > for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 to the slave > slave_1 | I1002 12:12:59.371335 7791 slave.cpp:2975] Forwarding the > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 to master@172.16.91.128:5050 > slave_1 | I1002 12:12:59.371908 7791 slave.cpp:2899] Status update > manager successfully handled status update TASK_FAILED (UUID: > 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 > master_1 | I1002 12:12:59.372047 11 master.cpp:4069] Status update > TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 from slave > 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 > (172.16.91.128) > master_1 | I1002 12:12:59.372534 11 master.cpp:4108] Forwarding status > update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task > sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 > master_1 | I1002 12:12:59.373018 11 master.cpp:5576] Updating the latest > state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework > 20150923-122130-2153451692-5050-1-0000 to TASK_FAILED > master_1 | I1002 12:12:59.373447 11 hierarchical.hpp:814] Recovered > cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; > mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: > cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from > framework 20150923-122130-2153451692-5050-1-0000 > {noformat} > Another issue: if you restart mesos-slave on the host with orphaned docker > containers, they are not getting killed. This was the case before and I hoped > for this trick to kill hanging containers, but it doesn't work now. > Marking this as critical because it hoards cluster resources and blocks > scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)