[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2016-05-10 Thread Kirk Leon Guerrero (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279214#comment-15279214
 ] 

Kirk Leon Guerrero commented on YARN-3678:
--

I added 2.7.2 to the affected versions, my NM receives SIGKILL often enough its 
a problem, and most often SIGTERM. (non-secure mode)

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.2
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2016-01-04 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082201#comment-15082201
 ] 

gu-chi commented on YARN-3678:
--

same issue as confirmed with [~hex108]

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-10-15 Thread Xu Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960087#comment-14960087
 ] 

Xu Chen commented on YARN-3678:
---

LGTM +1

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562180#comment-14562180
 ] 

gu-chi commented on YARN-3678:
--

I made this https://github.com/apache/hadoop/pull/20/

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560891#comment-14560891
 ] 

Hadoop QA commented on YARN-3678:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12735592/YARN-3678.patch |
| Optional Tests |  |
| git revision | trunk / bb18163 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/8099/console |


This message was automatically generated.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
> Attachments: YARN-3678.patch
>
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560748#comment-14560748
 ] 

Hong Zhiguo commented on YARN-3678:
---

the event sequence:
call "SEND SIGTERM"  ->  pid recycle   ->  call "SEND SIGKILL"  -> check 
process live time(based on current time)

The time between [call "SEND SIGTERM"] and [call "SEND SIGKILL"] is 250ms
The time between [pid recycle] and [check process live time] may be shorter or 
longer than 250ms. When it's longer than 250ms, there's chance we make false 
positive judgement.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560735#comment-14560735
 ] 

Varun Saxena commented on YARN-3678:


Yeah that's why I said if we can increase value of {{pid_max}} on a 64-bit 
machine to highest value it can take i.e. 2^22, that should mitigate the risk 
of this happening. But anyways, as I mentioned above, we can fix this though.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560676#comment-14560676
 ] 

Varun Vasudev commented on YARN-3678:
-

[~zhiguohong] - sorry my question was - after applying your fix, the problem 
should have gone away. However, you said - "With this fix, the "accident" rate 
is reduced from several times per day to nearly zero." Do you know why it still 
happened?

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560663#comment-14560663
 ] 

Hong Zhiguo commented on YARN-3678:
---

First, "stop container" happens frequently.
Second, the pid recycle doesn't need to have a whole round in 250ms.  Only need 
to have one or more rounds during the container lifetime.

If we have 100 times of "stop container" happen on one node per day, we have 
100/32768, about 0.3% chance for one node one day. That's not very low, 
especially when we have 5000 nodes.


> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560651#comment-14560651
 ] 

Naganarasimha G R commented on YARN-3678:
-

Hi [~vvasudev] & [~zhiguohong], For us it happened in secure setup and one key 
point is both the NM user and user of the container is same . But irrespective 
of this it could have killed any other process[container] for same/another app 
running in the same node, submitted by the same user. One suggestion(crude fix 
not sure how to get it working for other OS) is can we grep for the containerID 
and confirm its the same process we are targetting and then kill it  ? 

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560643#comment-14560643
 ] 

Varun Saxena commented on YARN-3678:


As [~zhiguohong] mentioned, even in our case same user is used for NM and 
app-submitter.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560630#comment-14560630
 ] 

Varun Saxena commented on YARN-3678:


Secure.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560615#comment-14560615
 ] 

Varun Saxena commented on YARN-3678:


I think if we increase the value of {{pid_max}}, issue is unlikely to occur.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560607#comment-14560607
 ] 

Varun Vasudev commented on YARN-3678:
-

[~zhiguohong] thanks for the detailed explanation! When you say your fix 
reduced the rate to nearly zero, do you know why the accidental kill continued 
to happen?

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-27 Thread Hong Zhiguo (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560578#comment-14560578
 ] 

Hong Zhiguo commented on YARN-3678:
---

We met same issue on our production cluster last year.  The same user  is used 
for NM and some app-submitter.
I hacked the kernel __send_signal function via kprobe 
(https://github.com/honkiko/signal-monitor) and confirmed the happening:
 - container-executor sends SIGTERM to a container (say, pid = X)
 - The container exits quickly (in 250ms)
 - pid X is recycled and taken by a new spawned thread of NM
 - after 250ms, container-executor sends SIGKILL to pid X
 - NM is killed

I added checking of living time before container-executor sends SIGKILL. If the 
process has living time shorter than 250ms,  it's not the target process that 
we send SIGTERM to, and just skip it.

With this fix, the "accident" rate is reduced from several times per day to 
nearly zero.
If you think such fix is acceptable, I'll post it here.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-26 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560485#comment-14560485
 ] 

Varun Vasudev commented on YARN-3678:
-

Is this in secure or non-secure mode?

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-26 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559760#comment-14559760
 ] 

Varun Saxena commented on YARN-3678:


[~vinodkv], I suspect its a single user for everything. [~gu chi] can  confirm 
though.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-26 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559649#comment-14559649
 ] 

Vinod Kumar Vavilapalli commented on YARN-3678:
---

Tx for the update [~varun_saxena]. You mentioned LCE. But like I said before, 
LCE kills containers as the app-submitter.  So, in your case, what is the user 
running the containers?

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551854#comment-14551854
 ] 

Varun Saxena commented on YARN-3678:


[~vinodkv], as this issue happened in our customer deployment, I will explain 
the issue. We got an issue wherein NM was being randomly killed at one of the 
places where Hadoop distribution is deployed. In logs, we could see NM being 
killed immediately after {{signalContainer}} is called. What happens is as 
under : 
# LCE sends a SIGTERM to the container and waits for 250 ms
# Probably within this 250 ms period, container processes the signal and exits 
gracefully.
# Now it is possible the pid assigned to container is taken up by some other 
process or thread(which run as light weight processes in Linux).
# When LCE again tries to send a SIGKILL to the same pid, it might actually be 
sending it to another process or thread.
# As we could not find any other reason for NM going randomly down, we suspect 
it may have gone down because some new thread of NM took up this pid and 
SIGKILL may have been sent to it, which may have crashed NM. This is more based 
on suspicion though rather than fool proof analysis. Not sure how to verify if 
this indeed happened.

Pls note that {{pid_max}} in the deployment was {{32768}}.
I am not sure about which user was the process owner though. Probably [~gu chi] 
can shed some light on that.
An additional check can be done IMHO.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551756#comment-14551756
 ] 

gu-chi commented on YARN-3678:
--

The PID number may be not use as a process, also can be a thread, linux treat 
process and thread the same, kill one thread in process may also kill the 
process too, for thread, 250ms is possible to start, rt?

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551681#comment-14551681
 ] 

gu-chi commented on YARN-3678:
--

I see the possibility is low, but with heavy task load, it occurs frequently. I 
would suggest to add a check before kill, check if the process ID belongs to 
the container.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551666#comment-14551666
 ] 

Vinod Kumar Vavilapalli commented on YARN-3678:
---

The default delay is 250 milliseconds. So it is very hard to hit this condition.

At least when LinuxContainerExecutor is used, the kill is done as the user 
itself, so it's unlikely it will affect other users' processes.

Other than also doing a user-check to ensure its the same user's container, I 
am not sure what else can be done.

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container

2015-05-19 Thread gu-chi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550390#comment-14550390
 ] 

gu-chi commented on YARN-3678:
--

I think if decrease the max_pid setting in OS can enlarge the possibility of 
reproducing, working on

> DelayedProcessKiller may kill other process other than container
> 
>
> Key: YARN-3678
> URL: https://issues.apache.org/jira/browse/YARN-3678
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.6.0
>Reporter: gu-chi
>Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still 
> exist and will trigger once singalContainer, this will kill the process with 
> the pid in PID file, but as container already finished, so this PID may be 
> occupied by other process, this may cause serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. 
> Even rarely occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)