[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604950#comment-16604950 ]
Qian Zhang commented on MESOS-8568: ----------------------------------- commit ba370822c94c8e9881eff3f63a02b38e18335ae4 Author: Qian Zhang Date: Thu Aug 23 17:44:53 2018 +0800 Made command check always waits before removing the nested container. Review: [https://reviews.apache.org/r/68495] commit b5c43f40b41b44ccae05d61e4aba8d004678cde1 Author: Qian Zhang Date: Wed Aug 29 11:22:41 2018 +0800 Made checker library retry to remove the previous check container. Previously when checker library fails to remove the previous check container, it will discard the promise and launch a new check container which will cause two problems: 1. The discarded promise is used to launch the new check container, that means even the new check container is launched successfully, we still have no chance to process its check result since the promise has already been discarded. 2. The previous check container will never get a chance to be removed which is leak, i.e., its runtime directory and sandbox directory will not be removed. Now in this patch, when checker library fails to remove the previous check container, we make it remove the previous check container again. Review: https://reviews.apache.org/r/68555 > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > ------------------------------------------------------------------------------------------ > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1 > Reporter: Andrei Budnik > Assignee: Qian Zhang > Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)