[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604950#comment-16604950
 ] 

Qian Zhang commented on MESOS-8568:
-----------------------------------

commit ba370822c94c8e9881eff3f63a02b38e18335ae4
Author: Qian Zhang 
Date: Thu Aug 23 17:44:53 2018 +0800

Made command check always waits before removing the nested container.
 
 Review: [https://reviews.apache.org/r/68495]

 

commit b5c43f40b41b44ccae05d61e4aba8d004678cde1
Author: Qian Zhang 
Date: Wed Aug 29 11:22:41 2018 +0800

Made checker library retry to remove the previous check container.
 
 Previously when checker library fails to remove the previous check
 container, it will discard the promise and launch a new check container
 which will cause two problems:
 1. The discarded promise is used to launch the new check container,
 that means even the new check container is launched successfully,
 we still have no chance to process its check result since the
 promise has already been discarded.
 2. The previous check container will never get a chance to be removed
 which is leak, i.e., its runtime directory and sandbox directory
 will not be removed.
 
 Now in this patch, when checker library fails to remove the previous
 check container, we make it remove the previous check container again.
 
 Review: https://reviews.apache.org/r/68555

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> ------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8568
>                 URL: https://issues.apache.org/jira/browse/MESOS-8568
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.4.0, 1.4.1, 1.4.2, 1.5.0, 1.5.1, 1.6.0, 1.6.1
>            Reporter: Andrei Budnik
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to