[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300266#comment-15300266 ] Eric Payne commented on YARN-4459: -- Thanks [~hex108] for the fix and [~jlowe] for the update and reveiw. Patch LGTM. +1 > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch, > YARN-4459.03.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298361#comment-15298361 ] Jun Gong commented on YARN-4459: Thanks [~jlowe] for review and updating the patch! {quote} This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent. {quote} Sorry for not adding it for a long time. I will add it in following jira if it is OK. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch, > YARN-4459.03.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296930#comment-15296930 ] Hadoop QA commented on YARN-4459: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 10s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 28s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 20m 59s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:2c91fd8 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12805718/YARN-4459.03.patch | | JIRA Issue | YARN-4459 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 094e9f4f0d09 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ac95448 | | Default Java | 1.8.0_91 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/11637/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/11637/console | | Powered by | Apache Yetus 0.2.0 http://yetus.apache.org | This message was automatically generated. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch, > YARN-4459.03.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoo
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296814#comment-15296814 ] Hadoop QA commented on YARN-4459: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 37s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 11m 49s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 16s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 21m 59s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:2c91fd8 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/1254/YARN-4459.02.patch | | JIRA Issue | YARN-4459 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 522b1ceab420 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ac95448 | | Default Java | 1.8.0_91 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/11634/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/11634/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/11634/console | | Powered by | Apache Yetus 0.2.0 http://yetus.apache.org | This message was automatically generated. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296729#comment-15296729 ] Jason Lowe commented on YARN-4459: -- Sorry to arrive to this late. I agree that we should be killing the session and not the pid. It's not a perfect solution, but it _drastically_ reduces the likelihood of the wrong process getting killed. This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent. However rather than holding up this significant improvement waiting for those things to be added, I propose we add this now and further iterate on it in a subsequent JIRA. +1 for the patch. Will commit this in a couple of days if there are no objections. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091926#comment-15091926 ] Jun Gong commented on YARN-4459: Thanks [~vvasudev] the review and suggestion. {quote} assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no? {quote} Yes, it might kill a wrong process, I have not found a perfect method to avoid this. At least it will not kill NM. {quote} The assumptions of the patch are 1) The new process will not belong to the same user and 2) The new process has not called setsid {quote} yeah, the *and* is *or* actually. It will kill a wrong process when new process belongs to the same user and has been called setsid. The rate is lower. {quote} I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578 {quote} Adding this check will reduce the rate that kills a wrong process. Will add it later. It could not avoid killing a wrong process either as [~zhiguohong] said in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560748&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560748. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091701#comment-15091701 ] Naganarasimha G R commented on YARN-4459: - +1 for timing check solution than this approrach ! > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091628#comment-15091628 ] Varun Vasudev commented on YARN-4459: - [~hex108] - assume that the process with the recycled pid does a setsid call - then the process group check will succeed and we might still end up killing the wrong process, no? The assumptions of the patch are 1) The new process will not belong to the same user and 2) The new process has not called setsid Correct? > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082158#comment-15082158 ] Jun Gong commented on YARN-4459: Thanks [~Naganarasimha] for the review, very appreciate it! {quote} IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume kill(-pid,0) will return -1 then it tries to kill only the container process id only. Can you confirm this by testing? {quote} I do not have a good method to add a test case for it now. I will try to explain it by following test cases: 1. If container's parent process does not exist and its child process does exist, existing code works well. {code} root@bd74d89a2294:/$ cat test.sh sleep & PID_FOR_SLEEP=$! PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=) echo "PID for 'sleep ' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP" root@bd74d89a2294:/$ ./test.sh PID for 'sleep ' : 26877, its pgid : 26876 root@bd74d89a2294:/$ ps -ef | grep sleep root 26877 1 0 08:48 pts/000:00:00 sleep root 26880 26797 0 08:48 pts/000:00:00 grep sleep root@bd74d89a2294:/$ kill -0 -26876 root@bd74d89a2294:/$ kill -15 -26876 root@bd74d89a2294:/$ ps -ef | grep sleep root 26882 26797 0 08:48 pts/000:00:00 grep sleep {code} 2. If container's parent process does not exist and its child process does not exist either, existing code will kill process wrongly. {code} root@bd74d89a2294:/$ cat test.sh sleep 2 & PID_FOR_SLEEP=$! PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=) echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP" root@bd74d89a2294:/$ ./test.sh PID for 'sleep 2' : 26890, its pgid : 26889 root@bd74d89a2294:/$ ps -ef | grep sleep root 26893 26797 0 08:56 pts/000:00:00 grep sleep root@bd74d89a2294:/$ kill -0 26889 -bash: kill: (26889) - No such process {code} Then we check existing code in container-executor.c for the above case: {code} if (kill(-pid,0) < 0) { if (kill(pid, 0) < 0) { if (errno == ESRCH) { return INVALID_CONTAINER_PID; } fprintf(LOGFILE, "Error signalling container %d with %d - %s\n", pid, sig, strerror(errno)); return -1; } else { has_group = 0; } } if (kill((has_group ? -1 : 1) * pid, sig) < 0) { {code} *kill(-pid,0)* will return -1. If *pid* is reused for a new process(suppose that the container has survived for a long time, pid recycle occurs, then this pid might be reused again), *kill(pid, 0)* will return 0, then *has_group* will be set to 0, and the code *kill((has_group ? -1 : 1) * pid, sig)* will try to kill *pid* with *sig*. This is the problem. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081372#comment-15081372 ] Naganarasimha G R commented on YARN-4459: - Hi [~hex108], thanks for working on this jira. I am not from c back ground, neverthless checked the API of kill and few doubts i have here IIUC existing code checks whether container process has created any sub process then kill all the process, else if its a single process then i presume {{kill(-pid,0)}} will return {{-1}} then it tries to kill only the container process id only. Can you confirm this by testing? I just tested this with unix command {{kill}} what i could understand was {{kill -0 -- -}} will be successfull and {{$?}} will return *0* but when i run {{kill -0 -- -}} then {{bash: kill: (-10967) - No such process}} will thrown. Correct me if my understanding is wrong. cc/ @[~vvasudev]. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068996#comment-15068996 ] Jun Gong commented on YARN-4459: Thanks [~Naganarasimha] for the info. Yes, it seems same problem. I think the problem does not only exist for 'DelayedProcessKiller', it might occur for every call to 'signal_container_as_user'. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067999#comment-15067999 ] Naganarasimha G R commented on YARN-4459: - Is this issue similar to YARN-3678 ? > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058211#comment-15058211 ] Hadoop QA commented on YARN-4459: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 28s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 45s {color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 13s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 30m 43s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_66 Failed junit tests | hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/1254/YARN-4459.02.patch | | JIRA Issue | YARN-4459 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 72717a2b4b82 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0c3a53e | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt | | unit test logs | https://builds.apache.org/job/PreCommit-YARN-Build/9983/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager-jdk1.8.0_66.txt | | JDK v1.7.0_91 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9983/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058163#comment-15058163 ] Jun Gong commented on YARN-4459: Update a new patch to fix whitespace error. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Reporter: Jun Gong >Assignee: Jun Gong > Attachments: YARN-4459.01.patch, YARN-4459.02.patch > > > When calling 'signal_container_as_user' in container-executor, it first > checks whether process group exists, if not, it will kill the process > itself(if it the process exists). It is not reasonable because that the > process group does not exist means corresponding container has finished, if > we kill the process itself, we just kill wrong process. > We found it happened in our cluster many times. We used same account for > starting NM and submitted app, and container-executor sometimes killed NM(the > wrongly killed process might just be a newly started thread and was NM's > child process). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4459) container-executor might kill process wrongly
[ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058099#comment-15058099 ] Hadoop QA commented on YARN-4459: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s {color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 18s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s {color} | {color:green} trunk passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s {color} | {color:green} trunk passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 30s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s {color} | {color:red} The patch has 1 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 58s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 18s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 32m 22s {color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:0ca8df7 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/1242/YARN-4459.01.patch | | JIRA Issue | YARN-4459 | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux d353811253b1 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0c3a53e | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/9981/artifact/patchprocess/whitespace-tabs.txt | | JDK v1.7.0_91 Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/9981/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Max memory used | 76MB | | Powered by | Apache Yetus 0.1.0 http://yetus.apache.org | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/9981/console | This message was automatically generated. > container-executor might kill process wrongly > - > > Key: YARN-4459 > URL: https://issues.apache.org/jira/browse/YARN-4459 >