[ 
https://issues.apache.org/jira/browse/YARN-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496492#comment-16496492
 ] 

Shane Kumpf edited comment on YARN-8259 at 5/31/18 12:51 PM:
-------------------------------------------------------------

I've been doing additional testing here and could use input from the community 
as all of the solutions have cons. Here is what I've tested and been 
considering.
----
1) */proc/pid check as yarn*

Pros:
 * No c-e changes
 * Works with Docker live restore

Cons:
 * Breaks down when using hide pid
 * Portability

----
2) */proc/pid or kill -0 as privileged user*

Pros:
 * Works with Docker live restore

Cons:
 * Circumvents hidepid, allows the yarn user to check the existence of any pid 
due to use of elevated privileges.
 * Portability (/proc method)

----
3) *docker inspect*

Pros:
 * No c-e changes
 * Uses the Docker API

Cons:
 * Requires retry handling to support Docker live restore.
 ** In the case of a Docker daemon upgrade, this means the upgrade must 
complete before the retries are exhausted, which could mean hundreds of retries.

----
4) *Hybrid* (Keep existing kill -0 for non-privileged, docker inspect for 
privileged)

Pros:
 * No c-e changes
 * Limits impacts to live restore

Cons:
 * Requires retry handling to support Docker live restore.
 * Different handling based on container type.

----
I believe #2 is a non-starter as it silently bypasses the hidepid option.  I'm 
leaning towards striking #3 from the list as well, as we really need the 
recovery logic to be solid, so I don't want to unnecessary impact 
non-privileged containers which appear to be working well.

At this point, I'm leaning towards #4 or #1 (with docs indicating that the NM 
user must be whitelisted if hidepid is enabled).


was (Author: shaneku...@gmail.com):
I've been doing additional testing here and could use input from the community 
as all of the solutions have cons. Here is what I've tested and been 
considering.
----
1) */proc/pid check as yarn*

Pros:
 * No c-e changes
 * Works for with Docker live restore

Cons:
 * Breaks down when using hide pid
 * Portability

----
2) */proc/pid or kill -0 as privileged user*

Pros:
 * Works for with Docker live restore

Cons:
 * Circumvents hidepid, allows the yarn user to check the existence of any pid 
due to use of elevated privileges.
 * Portability (/proc method)

----
3) *docker inspect*

Pros:
 * No c-e changes
 * Uses the Docker API

Cons:
 * Requires retry handling to support Docker live restore.
 ** In the case of a Docker daemon upgrade, this means the upgrade must 
complete before the retries are exhausted, which could mean hundreds of retries.

----
4) *Hybrid* (Keep existing kill -0 for non-privileged, docker inspect for 
privileged)

Pros:
 * No c-e changes
 * Limits impacts to live restore

Cons:
 * Requires retry handling to support Docker live restore.
 * Different handling based on container type.

----
I believe #2 is a non-starter as it silently bypasses the hidepid option.  I'm 
leaning towards striking #3 from the list as well, as we really need the 
recovery logic to be solid, so I don't want to unnecessary impact 
non-privileged containers which appear to be working well.

At this point, I'm leaning towards #4 or #1 (with docs indicating that the NM 
user must be whitelisted if hidepid is enabled).

> Revisit liveliness checks for Docker containers
> -----------------------------------------------
>
>                 Key: YARN-8259
>                 URL: https://issues.apache.org/jira/browse/YARN-8259
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>    Affects Versions: 3.0.2, 3.2.0, 3.1.1
>            Reporter: Shane Kumpf
>            Assignee: Shane Kumpf
>            Priority: Blocker
>              Labels: Docker
>         Attachments: YARN-8259.001.patch
>
>
> As privileged containers may execute as a user that does not match the YARN 
> run as user, sending the null signal for liveliness checks could fail. We 
> need to reconsider how liveliness checks are handled in the Docker case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to