[ https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15238353#comment-15238353 ]
Liqiang Lin commented on MESOS-5188: ------------------------------------ If that's the truth, the executor shall remove the stopped docker container in shutting down of executor, rather than try to stop the docker container. > docker executor thinks task is failed when docker container was stopped > ----------------------------------------------------------------------- > > Key: MESOS-5188 > URL: https://issues.apache.org/jira/browse/MESOS-5188 > Project: Mesos > Issue Type: Bug > Components: docker > Affects Versions: 0.28.0 > Reporter: Liqiang Lin > Fix For: 0.29.0 > > > Test cases: > 1. Launch a task with Swarm (on Mesos). > {code} > # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300 > {code} > 2. Then stop the docker container. > {code} > # docker -H 192.168.56.110:54375 ps > CONTAINER ID IMAGE COMMAND CREATED > STATUS PORTS NAMES > b4813ba3ed4d ubuntu "sleep 300" 9 seconds ago > Up 8 seconds > mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958 > # docker -H 192.168.56.110:54375 stop b4813ba3ed4d > b4813ba3ed4d > {code} > 3. Found the task is failed. See Mesos slave log, > {code} > I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 > for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for > framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > to user 'root' > I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 > of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 with resources > cpus(*):0.1; mem(*):32 in work directory > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for > executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received > within 75secs > I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max > allowed age: 2.342613645432778days > I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master > I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at > master@192.168.56.110:5050 > I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. > Attempting to register without authentication > I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor > '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000' > I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master > master@192.168.56.110:5050 > I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed > resources > I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 pid to > scheduler(1)@192.168.56.110:53375 > I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources > from to > I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor > '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from > executor(1)@192.168.56.110:40725 > I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is > identical to existing resources > I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task > '99ee7dc74861' to executor '99ee7dc74861' of framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at executor(1)@192.168.56.110:40725 > I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update > TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from > executor(1)@192.168.56.110:40725 > I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status > update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:12:21.310058 32308 slave.cpp:3593] Forwarding the update > TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to > master@192.168.56.110:5050 > I0407 09:12:21.310744 32308 slave.cpp:3503] Sending acknowledgement for > status update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for > task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to > executor(1)@192.168.56.110:40725 > I0407 09:12:21.315765 32307 status_update_manager.cpp:392] Received status > update acknowledgement (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:16:13.943831 32306 slave.cpp:3195] Handling status update > TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from > executor(1)@192.168.56.110:40725 > I0407 09:16:14.063880 32307 status_update_manager.cpp:320] Received status > update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:16:14.064471 32312 slave.cpp:3593] Forwarding the update TASK_FAILED > (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of > framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to > master@192.168.56.110:5050 > I0407 09:16:14.065014 32312 slave.cpp:3503] Sending acknowledgement for > status update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for > task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to > executor(1)@192.168.56.110:40725 > I0407 09:16:14.071398 32307 status_update_manager.cpp:392] Received status > update acknowledgement (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:16:14.950019 32311 slave.cpp:3721] executor(1)@192.168.56.110:40725 > exited > I0407 09:16:14.993948 32308 docker.cpp:1997] Executor for container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' has exited > I0407 09:16:14.994104 32308 docker.cpp:1717] Destroying container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > I0407 09:16:14.994875 32308 docker.cpp:1845] Running docker stop on container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > I0407 09:16:14.996547 32308 slave.cpp:4097] Executor '99ee7dc74861' of > framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 exited with status 0 > I0407 09:16:14.996695 32308 slave.cpp:4201] Cleaning up executor > '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at > executor(1)@192.168.56.110:40725 > I0407 09:16:14.997028 32312 gc.cpp:55] Scheduling > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > for gc 6.99998846192days in the future > I0407 09:16:14.997289 32308 slave.cpp:4289] Cleaning up framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:16:14.997691 32307 status_update_manager.cpp:282] Closing status > update streams for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 > I0407 09:16:14.997697 32312 gc.cpp:55] Scheduling > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861' > for gc 6.99998845958222days in the future > I0407 09:16:14.998423 32312 gc.cpp:55] Scheduling > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000' > for gc 6.99998845141333days in the future > I0407 09:16:18.470573 32308 slave.cpp:4593] Current disk usage 56.56%. Max > allowed age: 2.341048334320162days > {code} > Why docker executor thinks the task is failed when stop the docker container? > The docker container still exists. Should it be relay on state that docker > container is removed? > Docker executor code > {code} > void reaped( > ExecutorDriver* _driver, > const Future<Nothing>& run) > { > // Wait for docker->stop to finish, and best effort wait for the > // inspect future to complete with a timeout. > stop.onAny(defer(self(), [=](const Future<Nothing>&) { > inspect > .after(DOCKER_INSPECT_TIMEOUT, [=](const Future<Nothing>&) { > inspect.discard(); > return inspect; > }) > .onAny(defer(self(), [=](const Future<Nothing>&) { > CHECK_SOME(driver); > TaskState state; > string message; > if (!stop.isReady()) { > state = TASK_FAILED; > message = "Unable to stop docker container, error: " + > (stop.isFailed() ? stop.failure() : "future discarded"); > } else if (killed) { > state = TASK_KILLED; > } else if (!run.isReady()) { > state = TASK_FAILED; > message = "Docker container run error: " + > (run.isFailed() ? > run.failure() : "future discarded"); > {code} > Another problem is, Mesos thinks the task is failed, and do some clean up for > executor. But the orphan docker container is not removed. I still can start > it manually, Mesos can not trace the resource usage of this orphan container. > {code} > # docker -H 192.168.56.110:54375 start b4813ba3ed4d > b4813ba3ed4d > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)