[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-05-25 Thread Eric Payne (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300266#comment-15300266
 ] 

Eric Payne commented on YARN-4459:
--

Thanks [~hex108] for the fix and [~jlowe] for the update and reveiw.
Patch LGTM.
+1

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch, 
> YARN-4459.03.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-05-24 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298361#comment-15298361
 ] 

Jun Gong commented on YARN-4459:


Thanks [~jlowe] for review and updating the patch!

{quote}
This could be improved upon by adding a just-before-kill check of some sort 
and/or proactive cancelling of the timer when we see the child process exit 
before the SIGKILL is sent. 
{quote}
Sorry for not adding it for a long time. I will add it in following jira if it 
is OK.

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch, 
> YARN-4459.03.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-05-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296930#comment-15296930
 ] 

Hadoop QA commented on YARN-4459:
-

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
25s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
21s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 28s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. 
{color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 20m 59s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:2c91fd8 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12805718/YARN-4459.03.patch |
| JIRA Issue | YARN-4459 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 094e9f4f0d09 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / ac95448 |
| Default Java | 1.8.0_91 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/11637/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/11637/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch, 
> YARN-4459.03.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoo

[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-05-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296814#comment-15296814
 ] 

Hadoop QA commented on YARN-4459:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 
37s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
11s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 49s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 21m 59s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:2c91fd8 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/1254/YARN-4459.02.patch |
| JIRA Issue | YARN-4459 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 522b1ceab420 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / ac95448 |
| Default Java | 1.8.0_91 |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/11634/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/11634/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/11634/console |
| Powered by | Apache Yetus 0.2.0   http://yetus.apache.org |


This message was automatically generated.



> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(

[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-05-23 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296729#comment-15296729
 ] 

Jason Lowe commented on YARN-4459:
--

Sorry to arrive to this late.  I agree that we should be killing the session 
and not the pid.  It's not a perfect solution, but it _drastically_ reduces the 
likelihood of the wrong process getting killed.  This could be improved upon by 
adding a just-before-kill check of some sort and/or proactive cancelling of the 
timer when we see the child process exit before the SIGKILL is sent.  However 
rather than holding up this significant improvement waiting for those things to 
be added, I propose we add this now and further iterate on it in a subsequent 
JIRA.

+1 for the patch.  Will commit this in a couple of days if there are no 
objections.


> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-01-11 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091926#comment-15091926
 ] 

Jun Gong commented on YARN-4459:


Thanks [~vvasudev] the review and suggestion.

{quote}
assume that the process with the recycled pid does a setsid call - then the 
process group check will succeed and we might still end up killing the wrong 
process, no?
{quote}
Yes, it might kill a wrong process, I have not found a perfect method to avoid 
this. At least it will not kill NM.

{quote}
The assumptions of the patch are 
1) The new process will not belong to the same user and 
2) The new process has not called setsid
{quote}
yeah, the *and* is *or* actually. It will kill a wrong process when new process 
belongs to the same user and has been called setsid. The rate is lower.

{quote}
I suspect we might need to add a timing check similar to the one proposed in 
https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578
{quote}
Adding this check will reduce the rate that kills a wrong process.  Will add it 
later. It could not avoid killing a wrong process either as [~zhiguohong] said 
in 
https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560748&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560748.

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-01-11 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091701#comment-15091701
 ] 

Naganarasimha G R commented on YARN-4459:
-

+1 for timing check solution than this approrach !

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-01-11 Thread Varun Vasudev (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091628#comment-15091628
 ] 

Varun Vasudev commented on YARN-4459:
-

[~hex108] - assume that the process with the recycled pid does a setsid call - 
then the process group check will succeed and we might still end up killing the 
wrong process, no? 

The assumptions of the patch are 
1) The new process will not belong to the same user and 
2) The new process has not called setsid

Correct?

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-01-04 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082158#comment-15082158
 ] 

Jun Gong commented on YARN-4459:


Thanks [~Naganarasimha] for the review, very appreciate it!

{quote}
IIUC existing code checks whether container process has created any sub process 
then kill all the process, else if its a single process then i presume 
kill(-pid,0) will return -1 then it tries to kill only the container process id 
only. Can you confirm this by testing?
{quote}
I do not have a good method to add a test case for it now. I will try to 
explain it by following test cases:
1. If container's parent process does not exist and its child process does 
exist, existing code works well.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep  &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep ' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep ' : 26877, its pgid : 26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root 26877 1  0 08:48 pts/000:00:00 sleep 
root 26880 26797  0 08:48 pts/000:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 -26876
root@bd74d89a2294:/$ kill -15 -26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root 26882 26797  0 08:48 pts/000:00:00 grep sleep
{code}

2. If container's parent process does not exist and its child process does not 
exist either, existing code will kill process wrongly.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep 2 &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep 2' : 26890, its pgid : 26889
root@bd74d89a2294:/$ ps -ef | grep sleep
root 26893 26797  0 08:56 pts/000:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 26889
-bash: kill: (26889) - No such process
{code}
Then we check existing code in container-executor.c for the above case:
{code}
  if (kill(-pid,0) < 0) {
if (kill(pid, 0) < 0) {
  if (errno == ESRCH) {
return INVALID_CONTAINER_PID;
  }
  fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
  pid, sig, strerror(errno));
  return -1;
} else {
  has_group = 0;
}
  }

  if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
{code}
*kill(-pid,0)* will return -1. If *pid* is reused for a new process(suppose 
that the container has survived for a long time, pid recycle occurs, then this 
pid might be reused again), *kill(pid, 0)* will return 0, then *has_group* will 
be set to 0, and the code *kill((has_group ? -1 : 1) * pid, sig)* will try to 
kill *pid* with *sig*. This is the problem.

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2016-01-04 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081372#comment-15081372
 ] 

Naganarasimha G R commented on YARN-4459:
-

Hi [~hex108],
thanks for working on this jira. I am not from c back ground,  neverthless 
checked the API of kill and few doubts i have here
IIUC existing code checks whether container process has created any sub process 
then kill all the process, else if its a single process then i presume 
{{kill(-pid,0)}} will return {{-1}} then it tries to kill only the container 
process id only. Can you confirm this by testing?
I just tested this with unix command {{kill}} what i could understand was 
{{kill -0 -- -}} will be successfull and {{$?}} will 
return *0* but when i run {{kill -0 -- -}} then 
{{bash: kill: (-10967) - No such process}} will thrown.
Correct me if my understanding is wrong.
cc/ @[~vvasudev].

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2015-12-22 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068996#comment-15068996
 ] 

Jun Gong commented on YARN-4459:


Thanks [~Naganarasimha] for the info. Yes, it seems same problem. I think the 
problem does not only exist for 'DelayedProcessKiller', it might occur for 
every call to 'signal_container_as_user'. 

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2015-12-22 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067999#comment-15067999
 ] 

Naganarasimha G R commented on YARN-4459:
-

Is this issue similar to YARN-3678 ?

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2015-12-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058211#comment-15058211
 ] 

Hadoop QA commented on YARN-4459:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
28s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 45s {color} 
| {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 13s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
24s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 30m 43s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/1254/YARN-4459.02.patch |
| JIRA Issue | YARN-4459 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux 72717a2b4b82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 0c3a53e |
| unit | 
https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
 |
| unit test logs |  
https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt
 |
| JDK v1.7.0_91  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9983/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn

[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2015-12-15 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058163#comment-15058163
 ] 

Jun Gong commented on YARN-4459:


Update a new patch to fix whitespace error.

> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first 
> checks whether process group exists, if not, it will kill the process 
> itself(if it the process exists).  It is not reasonable because that the 
> process group does not exist means corresponding container has finished, if 
> we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for 
> starting NM and submitted app, and container-executor sometimes killed NM(the 
> wrongly killed process might just be a newly started thread and was NM's 
> child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

2015-12-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058099#comment-15058099
 ] 

Hadoop QA commented on YARN-4459:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
18s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
30s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 1 line(s) with tabs. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 58s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 18s 
{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with 
JDK v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
24s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 32m 22s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/1242/YARN-4459.01.patch |
| JIRA Issue | YARN-4459 |
| Optional Tests |  asflicense  compile  cc  mvnsite  javac  unit  |
| uname | Linux d353811253b1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed 
Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 0c3a53e |
| whitespace | 
https://builds.apache.org/job/PreCommit-YARN-Build/9981/artifact/patchprocess/whitespace-tabs.txt
 |
| JDK v1.7.0_91  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/9981/testReport/ |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 U: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 |
| Max memory used | 76MB |
| Powered by | Apache Yetus 0.1.0   http://yetus.apache.org |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/9981/console |


This message was automatically generated.



> container-executor might kill process wrongly
> -
>
> Key: YARN-4459
> URL: https://issues.apache.org/jira/browse/YARN-4459
>