[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675542#comment-16675542 ] Sunil Govindan commented on YARN-6091: -- Removing Fixed versions as YARN-7654 is containing this patch and committed to branches which matches to Fixed version field there. Pls let me know if any issues. Thanks. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Labels: Docker > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636996#comment-16636996 ] Eric Badger commented on YARN-6091: --- Given the fact that Docker support is pretty low on branch-2.x, I'm inclined to close this as won't fix and link YARN-7654 > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Labels: Docker > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16636178#comment-16636178 ] Jim Brennan commented on YARN-6091: --- [~jlowe], [~ebadger] this does indeed look like it was fixed by [YARN-7654] for 3.1/trunk. It is still a problem in branch-2.x though. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Labels: Docker > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466774#comment-16466774 ] Junping Du commented on YARN-6091: -- Move to 2.8.5 as 2.8.4 is in RC stage. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428161#comment-16428161 ] genericqa commented on YARN-6091: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} YARN-6091 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6091 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12884361/YARN-6091.002.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/20252/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428154#comment-16428154 ] Junping Du commented on YARN-6091: -- move to 2.8.4 as 2.8.3 is released last year. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146701#comment-16146701 ] Junping Du commented on YARN-6091: -- Thanks [~ebadger] for updating patch. This patch seems to be pending on review for a while. [~jlowe] and [~vvasudev], can you give it a quick look? btw, I think this is not a regression for 2.8 so move it to 2.8.3 to unblock 2.8.2 release. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16146345#comment-16146345 ] Hadoop QA commented on YARN-6091: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 27s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 31s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 34s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 34m 1s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:14b5c93 | | JIRA Issue | YARN-6091 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12884361/YARN-6091.002.patch | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux 124e5c29ae1d 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f59332b | | Default Java | 1.8.0_144 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/17193/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/17193/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch, YARN-6091.002.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143264#comment-16143264 ] Hadoop QA commented on YARN-6091: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s{color} | {color:red} YARN-6091 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6091 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862495/YARN-6091.001.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/17145/console | | Powered by | Apache Yetus 0.6.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143262#comment-16143262 ] Junping Du commented on YARN-6091: -- bq. This will end up having stderr print to stderr of the NM, but I think that is better than silently swallowing stderr altogether. This also has side effect - you have to find in NM stderr to get error/debug message that belongs to container/AM. [~suma.shivaprasad] is doing some enhancement in this area. Suma, any comments here? > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998755#comment-15998755 ] Junping Du commented on YARN-6091: -- CC [~sidharta-s]. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998754#comment-15998754 ] Junping Du commented on YARN-6091: -- Thanks [~zhengchenyu] for reporting the issue. Can you verify Eric's patch (which sounds a right approach to me) is also working fine in your test environment? > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15961043#comment-15961043 ] Hadoop QA commented on YARN-6091: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 24s{color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 59s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 18s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 29m 53s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:a9ad5d6 | | JIRA Issue | YARN-6091 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12862495/YARN-6091.001.patch | | Optional Tests | asflicense compile cc mvnsite javac unit | | uname | Linux bef3f2f98758 3.13.0-106-generic #153-Ubuntu SMP Tue Dec 6 15:44:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ad24464 | | Default Java | 1.8.0_121 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/15558/testReport/ | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15558/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Assignee: Eric Badger >Priority: Critical > Attachments: YARN-6091.001.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMT
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958854#comment-15958854 ] Eric Badger commented on YARN-6091: --- bq. I just changed the judging condition of pclose's return code, then the problem is solved. That doesn't really solve the problem though. pclose() will return the return value of wait4(), which in turn returns the return value of the underlying executed process. If this is something other than -1, but also not 0, then we will completely ignore the fact that that command failed. That is exactly what is happening here. This SIGPIPE happens to be a fairly benign error since we don't all that much care that the underlying process was trying to write to an already closed pipe, but just changing the if condition ignores the fact that the process returned 13 instead of 0. It would be much better to fix the reason that the SIGPIPE is being thrown in the first place. We can do this by redirecting stdout to /dev/null. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15958176#comment-15958176 ] zhengchenyu commented on YARN-6091: --- I just changed the judging condition of pclose's return code, then the problem is solved. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957837#comment-15957837 ] Eric Badger commented on YARN-6091: --- I think the issue is that the container-executor is calling pclose() before the command is finished. According to http://man7.org/linux/man-pages/man3/pclose.3p.html pclose() will close the stream before waiting for the command executed in popen() to terminate. So if the command that was executed in popen() is still executing when we call pclose() and is trying to write to stdout, the command will return a SIGPIPE (signal number 13). Since this depends on how long it takes to execute the command, it is non-deterministic between nodes and won't always return the SIGPIPE. I believe a fix would be to redirect stdout to /dev/null on the instances of popen() where we do not read the stdout of the command. I already tested locally that the stderr of the underlying command from popen() will be passed to the stderr of the process running popen(). This will end up having stderr print to stderr of the NM, but I think that is better than silently swallowing stderr altogether. I'll test out this approach locally and then put up a patch once I'm done. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957094#comment-15957094 ] Eric Badger commented on YARN-6091: --- [~zhengchenyu], do you have logs for this issue or are you able to reproduce it? Also, do you know which pclose failed inside of container-executor.c? > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.1 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822284#comment-15822284 ] Junping Du commented on YARN-6091: -- move to 2.8.1 given 2.8 is in RC process. > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.0 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer
[ https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821472#comment-15821472 ] zhengchenyu commented on YARN-6091: --- Here is the sever where appliation is running normally {code} [right server]cat /etc/redhat-release CentOS Linux release 7.1.1503 (Core) [right server]# cat /proc/version Linux version 3.10.0-229.20.1.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Nov 3 19:10:07 UTC 2015 [right server]# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) [right server]# rpm -qa | grep glibc glibc-devel-2.17-106.el7_2.8.x86_64 glibc-2.17-106.el7_2.8.x86_64 glibc-headers-2.17-106.el7_2.8.x86_64 glibc-common-2.17-106.el7_2.8.x86_64 {code} Here is the sever where appliation is running unnormally {code} [wrong server]# cat /etc/redhat-release CentOS Linux release 7.1.1503 (Core) [wrong server]# cat /proc/version Linux version 3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Fri Mar 6 11:36:42 UTC 2015 [wrong server]# gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper Target: x86_64-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux Thread model: posix gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) [wrong server]# rpm -qa | grep glibc glibc-common-2.17-106.el7_2.6.x86_64 glibc-devel-2.17-106.el7_2.6.x86_64 glibc-2.17-106.el7_2.6.x86_64 glibc-headers-2.17-106.el7_2.6.x86_64 glibc-2.17-106.el7_2.6.i686 {code} > the AppMaster register failed when use Docker on LinuxContainer > > > Key: YARN-6091 > URL: https://issues.apache.org/jira/browse/YARN-6091 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, yarn >Affects Versions: 2.8.0 > Environment: CentOS >Reporter: zhengchenyu >Priority: Critical > Fix For: 2.8.0 > > Original Estimate: 336h > Remaining Estimate: 336h > > In some servers, When I use Docker on LinuxContainer, I found the aciton that > AppMaster register to Resourcemanager failed. But didn't happen in other > servers. > I found the pclose (in container-executor.c) return different value in > different server, even though the process which is launched by popen is > running normally. Some server return 0, and others return 13. > Because yarn regard the application as failed application when pclose return > nonzero, and yarn will remove the AMRMToken, then the AppMaster register > failed because Resourcemanager have removed this applicaiton's token. > In container-executor.c, the judgement condition is whether the return code > is zero. But man the pclose, the document tells that "pclose return -1" > represent wrong. So I change the judgement condition, then slove this > problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org