[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279214#comment-15279214 ] Kirk Leon Guerrero commented on YARN-3678: -- I added 2.7.2 to the affected versions, my NM receives SIGKILL often enough its a problem, and most often SIGTERM. (non-secure mode) > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0, 2.7.2 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082201#comment-15082201 ] gu-chi commented on YARN-3678: -- same issue as confirmed with [~hex108] > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960087#comment-14960087 ] Xu Chen commented on YARN-3678: --- LGTM +1 > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562180#comment-14562180 ] gu-chi commented on YARN-3678: -- I made this https://github.com/apache/hadoop/pull/20/ > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560891#comment-14560891 ] Hadoop QA commented on YARN-3678: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | patch | 0m 0s | The patch command could not apply the patch during dryrun. | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12735592/YARN-3678.patch | | Optional Tests | | | git revision | trunk / bb18163 | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8099/console | This message was automatically generated. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > Attachments: YARN-3678.patch > > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560748#comment-14560748 ] Hong Zhiguo commented on YARN-3678: --- the event sequence: call "SEND SIGTERM" -> pid recycle -> call "SEND SIGKILL" -> check process live time(based on current time) The time between [call "SEND SIGTERM"] and [call "SEND SIGKILL"] is 250ms The time between [pid recycle] and [check process live time] may be shorter or longer than 250ms. When it's longer than 250ms, there's chance we make false positive judgement. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560735#comment-14560735 ] Varun Saxena commented on YARN-3678: Yeah that's why I said if we can increase value of {{pid_max}} on a 64-bit machine to highest value it can take i.e. 2^22, that should mitigate the risk of this happening. But anyways, as I mentioned above, we can fix this though. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560676#comment-14560676 ] Varun Vasudev commented on YARN-3678: - [~zhiguohong] - sorry my question was - after applying your fix, the problem should have gone away. However, you said - "With this fix, the "accident" rate is reduced from several times per day to nearly zero." Do you know why it still happened? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560663#comment-14560663 ] Hong Zhiguo commented on YARN-3678: --- First, "stop container" happens frequently. Second, the pid recycle doesn't need to have a whole round in 250ms. Only need to have one or more rounds during the container lifetime. If we have 100 times of "stop container" happen on one node per day, we have 100/32768, about 0.3% chance for one node one day. That's not very low, especially when we have 5000 nodes. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560651#comment-14560651 ] Naganarasimha G R commented on YARN-3678: - Hi [~vvasudev] & [~zhiguohong], For us it happened in secure setup and one key point is both the NM user and user of the container is same . But irrespective of this it could have killed any other process[container] for same/another app running in the same node, submitted by the same user. One suggestion(crude fix not sure how to get it working for other OS) is can we grep for the containerID and confirm its the same process we are targetting and then kill it ? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560643#comment-14560643 ] Varun Saxena commented on YARN-3678: As [~zhiguohong] mentioned, even in our case same user is used for NM and app-submitter. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560630#comment-14560630 ] Varun Saxena commented on YARN-3678: Secure. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560615#comment-14560615 ] Varun Saxena commented on YARN-3678: I think if we increase the value of {{pid_max}}, issue is unlikely to occur. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560607#comment-14560607 ] Varun Vasudev commented on YARN-3678: - [~zhiguohong] thanks for the detailed explanation! When you say your fix reduced the rate to nearly zero, do you know why the accidental kill continued to happen? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560578#comment-14560578 ] Hong Zhiguo commented on YARN-3678: --- We met same issue on our production cluster last year. The same user is used for NM and some app-submitter. I hacked the kernel __send_signal function via kprobe (https://github.com/honkiko/signal-monitor) and confirmed the happening: - container-executor sends SIGTERM to a container (say, pid = X) - The container exits quickly (in 250ms) - pid X is recycled and taken by a new spawned thread of NM - after 250ms, container-executor sends SIGKILL to pid X - NM is killed I added checking of living time before container-executor sends SIGKILL. If the process has living time shorter than 250ms, it's not the target process that we send SIGTERM to, and just skip it. With this fix, the "accident" rate is reduced from several times per day to nearly zero. If you think such fix is acceptable, I'll post it here. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560485#comment-14560485 ] Varun Vasudev commented on YARN-3678: - Is this in secure or non-secure mode? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559760#comment-14559760 ] Varun Saxena commented on YARN-3678: [~vinodkv], I suspect its a single user for everything. [~gu chi] can confirm though. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559649#comment-14559649 ] Vinod Kumar Vavilapalli commented on YARN-3678: --- Tx for the update [~varun_saxena]. You mentioned LCE. But like I said before, LCE kills containers as the app-submitter. So, in your case, what is the user running the containers? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551854#comment-14551854 ] Varun Saxena commented on YARN-3678: [~vinodkv], as this issue happened in our customer deployment, I will explain the issue. We got an issue wherein NM was being randomly killed at one of the places where Hadoop distribution is deployed. In logs, we could see NM being killed immediately after {{signalContainer}} is called. What happens is as under : # LCE sends a SIGTERM to the container and waits for 250 ms # Probably within this 250 ms period, container processes the signal and exits gracefully. # Now it is possible the pid assigned to container is taken up by some other process or thread(which run as light weight processes in Linux). # When LCE again tries to send a SIGKILL to the same pid, it might actually be sending it to another process or thread. # As we could not find any other reason for NM going randomly down, we suspect it may have gone down because some new thread of NM took up this pid and SIGKILL may have been sent to it, which may have crashed NM. This is more based on suspicion though rather than fool proof analysis. Not sure how to verify if this indeed happened. Pls note that {{pid_max}} in the deployment was {{32768}}. I am not sure about which user was the process owner though. Probably [~gu chi] can shed some light on that. An additional check can be done IMHO. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551756#comment-14551756 ] gu-chi commented on YARN-3678: -- The PID number may be not use as a process, also can be a thread, linux treat process and thread the same, kill one thread in process may also kill the process too, for thread, 250ms is possible to start, rt? > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551681#comment-14551681 ] gu-chi commented on YARN-3678: -- I see the possibility is low, but with heavy task load, it occurs frequently. I would suggest to add a check before kill, check if the process ID belongs to the container. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551666#comment-14551666 ] Vinod Kumar Vavilapalli commented on YARN-3678: --- The default delay is 250 milliseconds. So it is very hard to hit this condition. At least when LinuxContainerExecutor is used, the kill is done as the user itself, so it's unlikely it will affect other users' processes. Other than also doing a user-check to ensure its the same user's container, I am not sure what else can be done. > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
[ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550390#comment-14550390 ] gu-chi commented on YARN-3678: -- I think if decrease the max_pid setting in OS can enlarge the possibility of reproducing, working on > DelayedProcessKiller may kill other process other than container > > > Key: YARN-3678 > URL: https://issues.apache.org/jira/browse/YARN-3678 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.6.0 >Reporter: gu-chi >Priority: Critical > > Suppose one container finished, then it will do clean up, the PID file still > exist and will trigger once singalContainer, this will kill the process with > the pid in PID file, but as container already finished, so this PID may be > occupied by other process, this may cause serious issue. > As I know, my NM was killed unexpectedly, what I described can be the cause. > Even rarely occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332)