[ 
https://issues.apache.org/jira/browse/MESOS-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210909#comment-15210909
 ] 

Anand Mazumdar commented on MESOS-3573:
---------------------------------------

[~bobrik] This shouldn't be a problem. The transient error that you are linking 
to happens due to this:

When the agent is recovering, it tries to send a {{ReconnectExecutorMessage}} 
to reconnect with the executor. If for some reason, it fails like in your logs, 
probably due to the executor process itself being hung or executor process 
already exited. The agent would itself kill the container the executor is 
running in after 2 seconds (EXECUTOR_REREGISTER_TIMEOUT): 
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4700

Of course, if the docker daemon is still stuck and the agent is not able to 
invoke {{docker->stop}} on the container, it would fail. We cannot do anything 
about that as remarked in 1. of my earlier comment. Let me know if you have any 
further queries.

> Mesos does not kill orphaned docker containers
> ----------------------------------------------
>
>                 Key: MESOS-3573
>                 URL: https://issues.apache.org/jira/browse/MESOS-3573
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>            Reporter: Ian Babrou
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> After upgrade to 0.24.0 we noticed hanging containers appearing. Looks like 
> there were changes between 0.23.0 and 0.24.0 that broke cleanup.
> Here's how to trigger this bug:
> 1. Deploy app in docker container.
> 2. Kill corresponding mesos-docker-executor process
> 3. Observe hanging container
> Here are the logs after kill:
> {noformat}
> slave_1    | I1002 12:12:59.362002  7791 docker.cpp:1576] Executor for 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8' has exited
> slave_1    | I1002 12:12:59.362284  7791 docker.cpp:1374] Destroying 
> container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1    | I1002 12:12:59.363404  7791 docker.cpp:1478] Running docker stop 
> on container 'f083aaa2-d5c3-43c1-b6ba-342de8829fa8'
> slave_1    | I1002 12:12:59.363876  7791 slave.cpp:3399] Executor 
> 'sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c' of framework 
> 20150923-122130-2153451692-5050-1-0000 terminated with signal Terminated
> slave_1    | I1002 12:12:59.367570  7791 slave.cpp:2696] Handling status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 from @0.0.0.0:0
> slave_1    | I1002 12:12:59.367842  7791 slave.cpp:5094] Terminating task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c
> slave_1    | W1002 12:12:59.368484  7791 docker.cpp:986] Ignoring updating 
> unknown container: f083aaa2-d5c3-43c1-b6ba-342de8829fa8
> slave_1    | I1002 12:12:59.368671  7791 status_update_manager.cpp:322] 
> Received status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> slave_1    | I1002 12:12:59.368741  7791 status_update_manager.cpp:826] 
> Checkpointing UPDATE for status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> slave_1    | I1002 12:12:59.370636  7791 status_update_manager.cpp:376] 
> Forwarding update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) 
> for task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to the slave
> slave_1    | I1002 12:12:59.371335  7791 slave.cpp:2975] Forwarding the 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to master@172.16.91.128:5050
> slave_1    | I1002 12:12:59.371908  7791 slave.cpp:2899] Status update 
> manager successfully handled status update TASK_FAILED (UUID: 
> 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> master_1   | I1002 12:12:59.372047    11 master.cpp:4069] Status update 
> TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 from slave 
> 20151002-120829-2153451692-5050-1-S0 at slave(1)@172.16.91.128:5051 
> (172.16.91.128)
> master_1   | I1002 12:12:59.372534    11 master.cpp:4108] Forwarding status 
> update TASK_FAILED (UUID: 4a1b2387-a469-4f01-bfcb-0d1cccbde550) for task 
> sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000
> master_1   | I1002 12:12:59.373018    11 master.cpp:5576] Updating the latest 
> state of task sleepy.87eb6191-68fe-11e5-9444-8eb895523b9c of framework 
> 20150923-122130-2153451692-5050-1-0000 to TASK_FAILED
> master_1   | I1002 12:12:59.373447    11 hierarchical.hpp:814] Recovered 
> cpus(*):0.1; mem(*):16; ports(*):[31685-31685] (total: cpus(*):4; 
> mem(*):1001; disk(*):52869; ports(*):[31000-32000], allocated: 
> cpus(*):8.32667e-17) on slave 20151002-120829-2153451692-5050-1-S0 from 
> framework 20150923-122130-2153451692-5050-1-0000
> {noformat}
> Another issue: if you restart mesos-slave on the host with orphaned docker 
> containers, they are not getting killed. This was the case before and I hoped 
> for this trick to kill hanging containers, but it doesn't work now.
> Marking this as critical because it hoards cluster resources and blocks 
> scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to