[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369730#comment-15369730 ] Hudson commented on YARN-3995: -- SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10074/]) YARN-3995. Some of the NM events are not getting published due race (sjlee: rev cc16683cefe2611cf4de7819496aa54854f5394c) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestPerNodeTimelineCollectorsAuxService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/PerNodeTimelineCollectorsAuxService.java > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093086#comment-15093086 ] Naganarasimha G R commented on YARN-3995: - In line with this point hence i had changed the stop order too in the latest patch so that we wait for the executor service and then stop the Collectormanager too... > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093087#comment-15093087 ] Naganarasimha G R commented on YARN-3995: - Thanks for the Commit and Review, [~sjlee0] & [~varun_saxena]. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Fix For: YARN-2928 > > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092386#comment-15092386 ] Sangjin Lee commented on YARN-3995: --- I'm +1 on the latest patch too. I'll commit it shortly. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092323#comment-15092323 ] Varun Saxena commented on YARN-3995: That makes sense. I am +1 on the patch. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092274#comment-15092274 ] Sangjin Lee commented on YARN-3995: --- I also thought about shutdownNow. The only reason I (slightly) preferred shutdown and waiting for that linger period is to give the "last" app a fair chance of being able to linger for the same amount of time. I think it is still a relatively minor point, though. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092256#comment-15092256 ] Varun Saxena commented on YARN-3995: Should we use shutdownNow instead ? We are just removing entry from a map. And on stop, collectors would be stopped in collector manager stop anyways. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090452#comment-15090452 ] Naganarasimha G R commented on YARN-3995: - checkstyle in the latest patch is for size of {{YarnConfiguration.Java}} which is not directly related to my modifications ! > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch, > YARN-3995-feature-YARN-2928.v1.003.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090448#comment-15090448 ] Hadoop QA commented on YARN-3995: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 28s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 33s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 43s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 55s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 1s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 19s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 57s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 216, now 215). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 51s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 6s {color} | {color
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090213#comment-15090213 ] Sangjin Lee commented on YARN-3995: --- It looks pretty good. Thanks for updating the patch [~Naganarasimha]. One last minor point is the shutdown() call in serviceStop(). It might be a good idea to wait a little bit to ensure the executor service is shut down. It's virtually certain all outstanding tasks will finish within the linger period, so something like the following is slightly more helpful: {code} scheduler.shutdown(); if (!scheduler.awaitTermination(collectorLingerPeriod, TimeUnit.MILLISECONDS)) { LOG.warn(...); } ... {code} Also, are some of the checkstyle warnings fixable? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, > YARN-3995-feature-YARN-2928.v1.002.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089719#comment-15089719 ] Hadoop QA commented on YARN-3995: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 56s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 14s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 34s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 43s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 55s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 44s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 11s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 11s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s {color} | {color:red} Patch generated 3 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 216, now 217). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 32s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 8s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 40s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {colo
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088636#comment-15088636 ] Naganarasimha G R commented on YARN-3995: - Oops my mistake... assumed the interface wrongly !. its similar to the timer service where in we can say when the task to be executed, got it will correct it ! > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088549#comment-15088549 ] Sangjin Lee commented on YARN-3995: --- bq. Yes i wanted to address it as i was trying to point out earlier Instead of spawning multiple threads may be we can have single thread which does this activity Oops, sorry. I didn't see you already mentioned this. {quote} IIUC the approach you mentioned in the callable we will be sleeping for the configured period for a application and then remove it. but if multiple apps at the same time finish then initial apps only wait for configured period but subsequent apps wait for lil more time than the earlier ones.(app's wait period + other apps wait period in the queue ) thoughts? {quote} ScheduledExecutorService is much more straightforward than that. We can simply take advantage of the scheduling feature. The Runnable (or Callable, doesn't matter) can simply execute removeApplication(): {code} ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor(); ... public void stopContainer(ContainerTerminationContext context) { ... scheduler.schedule(new Runnable() { public void run() { removeApplicationId(appId); } }, collectorLingerPeriod, TimeUnit.MILLISECONDS); } {code} It doesn't do this by actually putting the executor service thread to sleep for that period, thus there is no worry about delays propagating to the next work item. The delay management is all done using the internal queue that understands the delays. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088490#comment-15088490 ] Naganarasimha G R commented on YARN-3995: - bq. Instead of spawning multiple threads may be we can have single thread which does this activity ? Yes i wanted to address it as i was trying to point out earlier ??Instead of spawning multiple threads may be we can have single thread which does this activity?? bq. How about creating a long-lived single ScheduledExecutorService and schedule removeApplication() with the specified delay? IIUC the approach you mentioned in the callable we will be sleeping for the configured period for a application and then remove it. but if multiple apps at the same time finish then initial apps only wait for configured period but subsequent apps wait for lil more time than the earlier ones.(app's wait period + other apps wait period in the queue ) thoughts? Some approaches i can adopt to avoid the above issue are : * Have the timestamp when *close AM container* was called in the callable, and in the callable we can have code to wait only if the elapsed time < configured linger time. * Have a map and a single thread(either executor service/ timer task) with lower interval like 500ms and it can check this map and remove all the apps whose elapsed time is > configured linger time. thoughts ? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086218#comment-15086218 ] Hadoop QA commented on YARN-3995: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 56s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 40s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 53s {color} | {color:green} feature-YARN-2928 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s {color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 58s {color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 59s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 24s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 33s {color} | {color:red} Patch generated 3 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 217, now 218). {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 25s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 31s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s {color} | {color:green} the patch passed with JDK v1.8.0_66 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 47s {color} | {color:green} the patch passed with JDK v1.7.0_91 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 52s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_66. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s {color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 6s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_91. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s {color} | {col
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086072#comment-15086072 ] Naganarasimha G R commented on YARN-3995: - bq. Are you thinking of cases where the AM crashes? If the app finishes normally, this sequence does not happen, right? Well was just having a hunch that suppose AM finishes before its containers finishes (like AM will note once container informs AM through umbilical protocol that its finished but may be container is not yet finished one of the possible reasons being Timeline client has not yet finished flushing the ATS events or any other reason for cleaning up) > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > Attachments: YARN-3995-feature-YARN-2928.v1.001.patch > > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070723#comment-15070723 ] Sangjin Lee commented on YARN-3995: --- bq. This is true in most of the cases, unless and untill AM doesn't wait for the containers launched/requested by it to go down before it goes down. Are you thinking of cases where the AM crashes? If the app finishes normally, this sequence does not happen, right? bq. Yes simple linger should be sufficient, shall i make this configurable period ? so that there is backup option in case of any issues and if required in future we can handle it in a better way ? Making it configurable sounds fine to me. bq. Also is launching one thread per collector for closing it is fine ? I suspect it would be fine. Note that there would be a few collectors per NM at most. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070716#comment-15070716 ] Varun Saxena commented on YARN-3995: bq. what i am trying to suggest is close/remove the collector only after a period of inactivity in the collector That would be better. I guess what you mean is that instead of hard timeout, we will have rolling timeout i.e. timeout will keep on being pushed as entities are written. It will only timeout once no entities are being written for the specified period. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070709#comment-15070709 ] Naganarasimha G R commented on YARN-3995: - bq. If I recall, this window of opportunity is going to be quite small because any non-AM container will be completed before the app can be finished (and the AM container is completed). This is true in most of the cases, unless and untill AM doesn't wait for the containers launched/requested by it to go down before it goes down. I ran TestDistributedShell and cross verified the logs for any errors due to collector being not there and din't find any for the containers launched by it. But TestDistributedShell launches only 2 containers if we run with more container then can find the impact. bq. I suspect a simple linger might be sufficient, but do we see a case where we might miss writes otherwise? Yes simple linger should be sufficient, shall i make this configurable period ? so that there is backup option in case of any issues and if required in future we can handle it in a better way ? Also is launching one thread per collector for closing it is fine ? IMO configurable linger period is sufficient > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070690#comment-15070690 ] Sangjin Lee commented on YARN-3995: --- If I recall, this window of opportunity is going to be quite small because any non-AM container will be completed before the app can be finished (and the AM container is completed). For this inversion to occur, there would have to be writes that originate from a remote NM that had a container (which had already been completed) but get delayed in reaching the timeline collector for some reason. I suspect a simple linger might be sufficient, but do we see a case where we might miss writes otherwise? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070660#comment-15070660 ] Naganarasimha G R commented on YARN-3995: - Oops, Sorry my mistake , Thanks [~sjlee0] for correcting me. [~sjlee0] current code is already waiting for a second in a separate thread after AM container is closed (in PerNodeTimelineCollectorsAuxService.stopContainer), but the issue with that approach is: it just closes after 1 second though the events are still coming, but what i am trying to suggest is close/remove the collector only after a period of inactivity in the collector. Will that be good considering it will be usually getting delayed for metrics ? if above approach is not required then already existing approach waits for a second in separate thread, does it req any change ? (least i can think is few threads will be there if more AM's are run from a single NM ) > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070643#comment-15070643 ] Sangjin Lee commented on YARN-3995: --- That would be [~vrushalic], not me. :) It might be bit better if we can build this "lingering" functionality at the per-app collector level. Note that we will have an option of running per-app collectors in their own processes. It would be nice if this functionality translates to that mode without much work. Also, note that this linger doesn't need to be too long as we discussed offline. I think 1-2 seconds was more than enough? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070635#comment-15070635 ] Naganarasimha G R commented on YARN-3995: - Thanks for the comments [~sjlee0], IIUC 2nd point is continuation of the first idea right ? bq. I am not too knowledgeable about the NM and so not sure if this is complicated/infeasible. {{PerNodeTimelineCollectorsAuxService}} can take this responsibility so i don't see any problem to it with NM, right ? I can think of little modification on top of your idea, * Once NM notifies the Auxillary service that the app is finished (by container finished call in the existing way), {{PerNodeTimelineCollectorsAuxService}} can add move this collector to a zombie collector Map. * This map stores the last event published time for the zombie collector. * We can have one thread running to check which zombie collector is inactive for configurable time period and then remove it Thus none of the events are lost till the end. like we can keep this period as 2 mins and if the collector in the zombie list not active for 2 mins then remove it and close it ? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070185#comment-15070185 ] Vrushali C commented on YARN-3995: -- Hi [~Naganarasimha] Thanks for the thoughts on the jira. I was wondering if the following is a feasible solution: - can the NM container maintain a list/map info of “zombie app ids” for AMs/collectors that it is removing? That way when metrics arrive at the NM from other NMs for those zombie app ids, it can see if this was for an app that previously had a collector and hence most likely still a valid metric/entity and then somehow write that to the backend, perhaps via a “common parent collector” process or something. - we can have the NM periodically prune this zombie list, perhaps say a few days after app completion, remove the info for that app from the zombie app list. I am not too knowledgeable about the NM and so not sure if this is complicated/infeasible. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068068#comment-15068068 ] Naganarasimha G R commented on YARN-3995: - Hi [~sjlee0], As per the discussion we had in the status call, we planned to stop the collector after 2 seconds of the AM container finished, but already we are having a code which waits for one second and then closes the collector. Now IIUC the scope of this jira : # Introduce a configurable period to wait # Instead of spawning multiple threads may be we can have single thread which does this activity ? Or do we need to introduce some thing else ? bq When RM finishes the attempt then it can send one finish event through timelineclient IMO this also will not gurantee that no event is missed. So i think configurable wait period is better. Thoughts ? > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Labels: yarn-2928-1st-milestone > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650076#comment-14650076 ] Sangjin Lee commented on YARN-3995: --- Let me give it some more thought. I'll chime in early next week. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649990#comment-14649990 ] Naganarasimha G R commented on YARN-3995: - Hi [~sjlee0] & [~zjshen], Hoping for some comments on the approach which has to be taken for this issue. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM
[ https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647017#comment-14647017 ] Naganarasimha G R commented on YARN-3995: - Two approaches were discussed till now : # we can have timer task which periodically cleans up collector after some period and not imm remove it when AM container is finished. # When RM finishes the attempt then it can send one finish event through timelineclient for the ApplicationEntity which is kind of a marker based on which NM's TimelineCollectorManager can act upon. > Some of the NM events are not getting published due race condition when AM > container finishes in NM > > > Key: YARN-3995 > URL: https://issues.apache.org/jira/browse/YARN-3995 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, timelineserver >Affects Versions: YARN-2928 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > > As discussed in YARN-3045: While testing in TestDistributedShell found out > that few of the container metrics events were failing as there will be race > condition. When the AM container finishes and removes the collector for the > app, still there is possibility that all the events published for the app by > the current NM and other NM are still in pipeline, -- This message was sent by Atlassian JIRA (v6.3.4#6332)