[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619415#comment-16619415 ]
Alexander Rukletsov edited comment on MESOS-9131 at 9/18/18 5:41 PM: --------------------------------------------------------------------- *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} *{{master}} aka {{1.7.1}}*: {noformat} commit e9605a6243db41c1bbc85ec9ade112f2ef806c15 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:27:17 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ (cherry picked from commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0) {noformat} {noformat} commit f672afef601c71d69a9eb4db3c191bacfe167d3e Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ (cherry picked from commit e941d206f651bde861675a6517a89e44d1f61a34) {noformat} {noformat} commit 4a1b3186a2fa64bf7d94787f3546dd584e2f1186 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:27:17 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ (cherry picked from commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4) {noformat} was (Author: alexr): *{{master}} aka {{1.8-dev}}*: {noformat} commit 2fdc8f3cffc5eac91e5f2b0c6aef2254acfc2bd0 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:09:31 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:09:31 2018 +0200 Fixed IOSwitchboard waiting EOF from attach container input request. Previously, when a corresponding nested container terminated, while the user was attached to the container's stdin via `ATTACH_CONTAINER_INPUT` IOSwitchboard didn't terminate immediately. IOSwitchboard was waiting for EOF message from the input HTTP connection. Since the IOSwitchboard was stuck, the corresponding nested container was also stuck in `DESTROYING` state. This patch fixes the aforementioned issue by sending 200 `OK` response for `ATTACH_CONTAINER_INPUT` call in the case when io redirect is finished while reading from the HTTP input connection is not. Review: https://reviews.apache.org/r/68232/ {noformat} {noformat} commit e941d206f651bde861675a6517a89e44d1f61a34 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:01 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:10:01 2018 +0200 Added `AgentAPITest.LaunchNestedContainerSessionKillTask` test. This test verifies that IOSwitchboard, which holds an open HTTP input connection, terminates once IO redirects finish for the corresponding nested container. Review: https://reviews.apache.org/r/68230/ {noformat} {noformat} commit 7ad390b3aa261f4a39ff7f2c0842f2aae39005f4 Author: Andrei Budnik <abud...@mesosphere.com> AuthorDate: Tue Sep 18 19:10:07 2018 +0200 Commit: Alexander Rukletsov <al...@apache.org> CommitDate: Tue Sep 18 19:10:07 2018 +0200 Added `AgentAPITest.AttachContainerInputRepeat` test. This test verifies that we can call `ATTACH_CONTAINER_INPUT` more than once. We send a short message first then we send a long message in chunks. Review: https://reviews.apache.org/r/68231/ {noformat} > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks > ------------------------------------------------------------------------------------------------------- > > Key: MESOS-9131 > URL: https://issues.apache.org/jira/browse/MESOS-9131 > Project: Mesos > Issue Type: Bug > Components: agent, containerization > Affects Versions: 1.5.1 > Reporter: Jan Schlicht > Assignee: Andrei Budnik > Priority: Blocker > Labels: container-stuck > Fix For: 1.5.2, 1.6.2, 1.7.1, 1.8.0 > > > A container might get stuck in {{DESTROYING}} state if there's a command > health check that starts new nested containers while its parent container is > getting destroyed. > Here are some logs which unrelated lines removed. The > `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping > afterwards. > {noformat} > 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] > Container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has > exited > 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] > Destroying container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in > RUNNING state > 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] > Transitioning the state of container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 > from RUNNING to DESTROYING > 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] > Asked to destroy container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] > Using freezer to destroy cgroup > mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 3.814144ms > 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 5.977856ms > ... > 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd' > 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337] > Attempted to destroy unknown container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd > ... > 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing > REMOVE_NESTED_CONTAINER call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6' > ... > 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211' > ... > 2018-04-16 12:37:55: W0416 12:37:55.582137 3850 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > ... > 2018-04-16 12:37:55: W0416 12:37:55.583330 3844 containerizer.cpp:2337] > Attempted to destroy unknown container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211 > ... > {noformat} > This stops when the framework reconciles and instructs Mesos to kill the > task. Which also results in a > {noformat} > 2018-04-16 13:06:04: I0416 13:06:04.161623 3843 http.cpp:2966] Processing > KILL_NESTED_CONTAINER call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133' > {noformat} > Nothing else related to this container is logged following this line. -- This message was sent by Atlassian JIRA (v7.6.3#76005)