[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-07-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369730#comment-15369730
 ] 

Hudson commented on YARN-3995:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #10074 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10074/])
YARN-3995. Some of the NM events are not getting published due race (sjlee: rev 
cc16683cefe2611cf4de7819496aa54854f5394c)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/test/java/org/apache/hadoop/yarn/server/timelineservice/collector/TestPerNodeTimelineCollectorsAuxService.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice/src/main/java/org/apache/hadoop/yarn/server/timelineservice/collector/PerNodeTimelineCollectorsAuxService.java


> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093086#comment-15093086
 ] 

Naganarasimha G R commented on YARN-3995:
-

In line with this point hence i had changed the stop order too in the latest 
patch so that we wait for the executor service and then stop the 
Collectormanager too... 

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15093087#comment-15093087
 ] 

Naganarasimha G R commented on YARN-3995:
-

Thanks for the Commit and Review, [~sjlee0] & [~varun_saxena].

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Fix For: YARN-2928
>
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092386#comment-15092386
 ] 

Sangjin Lee commented on YARN-3995:
---

I'm +1 on the latest patch too. I'll commit it shortly.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092323#comment-15092323
 ] 

Varun Saxena commented on YARN-3995:


That makes sense. I am +1 on the patch.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092274#comment-15092274
 ] 

Sangjin Lee commented on YARN-3995:
---

I also thought about shutdownNow. The only reason I (slightly) preferred 
shutdown and waiting for that linger period is to give the "last" app a fair 
chance of being able to linger for the same amount of time. I think it is still 
a relatively minor point, though.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-11 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092256#comment-15092256
 ] 

Varun Saxena commented on YARN-3995:


Should we use shutdownNow instead ? We are just removing entry from a map. And 
on stop, collectors would be stopped in collector manager stop anyways.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-08 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090452#comment-15090452
 ] 

Naganarasimha G R commented on YARN-3995:
-

checkstyle in the latest patch is for size of {{YarnConfiguration.Java}} which 
is not directly related to my modifications !

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch, 
> YARN-3995-feature-YARN-2928.v1.003.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090448#comment-15090448
 ] 

Hadoop QA commented on YARN-3995:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
28s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 3s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
33s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 34s 
{color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
43s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
55s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 34s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 4m 1s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
19s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 57s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 19s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 19s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 216, now 215). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 29s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
34s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 50s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 51s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 6s 
{color} | {color

[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-08 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090213#comment-15090213
 ] 

Sangjin Lee commented on YARN-3995:
---

It looks pretty good. Thanks for updating the patch [~Naganarasimha].

One last minor point is the shutdown() call in serviceStop(). It might be a 
good idea to wait a little bit to ensure the executor service is shut down. 
It's virtually certain all outstanding tasks will finish within the linger 
period, so something like the following is slightly more helpful:
{code}
scheduler.shutdown();
if (!scheduler.awaitTermination(collectorLingerPeriod, TimeUnit.MILLISECONDS)) {
  LOG.warn(...);
}
...
{code}

Also, are some of the checkstyle warnings fixable?

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch, 
> YARN-3995-feature-YARN-2928.v1.002.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089719#comment-15089719
 ] 

Hadoop QA commented on YARN-3995:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 9m 
56s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 50s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 14s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
34s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 32s 
{color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
43s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
55s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 31s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 44s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
14s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 46s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 46s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 11s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 11s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 32s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 216, now 217). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 8s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 40s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 20s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s 
{color} | {colo

[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-07 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088636#comment-15088636
 ] 

Naganarasimha G R commented on YARN-3995:
-

Oops my mistake... assumed the interface wrongly !. its similar to the timer 
service where in we can say when the task to be executed, got it will correct 
it !

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-07 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088549#comment-15088549
 ] 

Sangjin Lee commented on YARN-3995:
---

bq. Yes i wanted to address it as i was trying to point out earlier Instead of 
spawning multiple threads may be we can have single thread which does this 
activity

Oops, sorry. I didn't see you already mentioned this.

{quote}
IIUC the approach you mentioned in the callable we will be sleeping for the 
configured period for a application and then remove it. but if multiple apps at 
the same time finish then initial apps only wait for configured period but 
subsequent apps wait for lil more time than the earlier ones.(app's wait period 
+ other apps wait period in the queue ) thoughts?
{quote}

ScheduledExecutorService is much more straightforward than that. We can simply 
take advantage of the scheduling feature. The Runnable (or Callable, doesn't 
matter) can simply execute removeApplication():

{code}
ScheduledExecutorService scheduler = 
Executors.newSingleThreadScheduledExecutor();
...
public void stopContainer(ContainerTerminationContext context) {
  ...
  scheduler.schedule(new Runnable() {
public void run() {
  removeApplicationId(appId);
}
  }, collectorLingerPeriod, TimeUnit.MILLISECONDS);
}
{code}

It doesn't do this by actually putting the executor service thread to sleep for 
that period, thus there is no worry about delays propagating to the next work 
item. The delay management is all done using the internal queue that 
understands the delays.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-07 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088490#comment-15088490
 ] 

Naganarasimha G R commented on YARN-3995:
-

bq. Instead of spawning multiple threads may be we can have single thread which 
does this activity ?
Yes i wanted to address it as i was trying to point out earlier ??Instead of 
spawning multiple threads may be we can have single thread which does this 
activity??

bq. How about creating a long-lived single ScheduledExecutorService and 
schedule removeApplication() with the specified delay?
IIUC the approach you mentioned in the callable we will be sleeping for the 
configured period for a application and then remove it. but if multiple apps at 
the same time finish then initial apps only wait for configured period but 
subsequent apps wait for lil more time than the earlier ones.(app's wait period 
+ other apps wait period in the queue ) thoughts?
Some approaches i can adopt to avoid the above issue are :
* Have the timestamp when *close AM container* was called in the callable, and 
in the callable we can have code to wait only if the elapsed time < configured 
linger time.
* Have a map and a single thread(either executor service/ 
timer task) with lower interval like 500ms and it can check this map and remove 
all the apps whose elapsed time is > configured linger time.
thoughts ?

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086218#comment-15086218
 ] 

Hadoop QA commented on YARN-3995:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
56s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 57s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 18s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
32s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 33s 
{color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
40s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 
53s {color} | {color:green} feature-YARN-2928 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 58s 
{color} | {color:green} feature-YARN-2928 passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 59s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 2m 24s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 33s 
{color} | {color:red} Patch generated 3 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn (total was 217, now 218). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 0s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 3m 47s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.8.0_66. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 50s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 52s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.8.0_66. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 24s 
{color} | {color:green} hadoop-yarn-api in the patch passed with JDK v1.7.0_91. 
{color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 6s 
{color} | {color:green} hadoop-yarn-common in the patch passed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 5s 
{color} | {col

[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2016-01-06 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086072#comment-15086072
 ] 

Naganarasimha G R commented on YARN-3995:
-

bq. Are you thinking of cases where the AM crashes? If the app finishes 
normally, this sequence does not happen, right?
Well was just having a hunch that suppose AM finishes before its containers 
finishes (like AM will note once container informs AM through umbilical 
protocol that its finished but may be container is not yet finished one of the 
possible reasons being Timeline client has not yet finished flushing the ATS 
events or any other reason for cleaning up)


> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
> Attachments: YARN-3995-feature-YARN-2928.v1.001.patch
>
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070723#comment-15070723
 ] 

Sangjin Lee commented on YARN-3995:
---

bq. This is true in most of the cases, unless and untill AM doesn't wait for 
the containers launched/requested by it to go down before it goes down.

Are you thinking of cases where the AM crashes? If the app finishes normally, 
this sequence does not happen, right?

bq. Yes simple linger should be sufficient, shall i make this configurable 
period ? so that there is backup option in case of any issues and if required 
in future we can handle it in a better way ?

Making it configurable sounds fine to me.

bq. Also is launching one thread per collector for closing it is fine ?

I suspect it would be fine. Note that there would be a few collectors per NM at 
most.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Varun Saxena (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070716#comment-15070716
 ] 

Varun Saxena commented on YARN-3995:


bq. what i am trying to suggest is close/remove the collector only after a 
period of inactivity in the collector
That would be better. I guess what you mean is that instead of hard timeout, we 
will have rolling timeout i.e. timeout will keep on being pushed as entities 
are written. It will only timeout once no entities are being written for the 
specified period.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070709#comment-15070709
 ] 

Naganarasimha G R commented on YARN-3995:
-

bq. If I recall, this window of opportunity is going to be quite small because 
any non-AM container will be completed before the app can be finished (and the 
AM container is completed).
This is true in most of the cases, unless and untill AM doesn't wait for the 
containers launched/requested by it to go down before it goes down. 
I ran TestDistributedShell and cross verified the logs for any errors due to 
collector being not there and din't find any for the containers launched by it. 
But TestDistributedShell launches only 2 containers if we run with more 
container then can find the impact.

bq. I suspect a simple linger might be sufficient, but do we see a case where 
we might miss writes otherwise?
Yes simple linger should be sufficient, shall i make this configurable period ? 
so that there is backup option in case of any issues and if required in future 
we can handle it in a better way ? Also is launching one thread per collector 
for closing it is fine ? IMO configurable linger period is sufficient 


> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070690#comment-15070690
 ] 

Sangjin Lee commented on YARN-3995:
---

If I recall, this window of opportunity is going to be quite small because any 
non-AM container will be completed before the app can be finished (and the AM 
container is completed). For this inversion to occur, there would have to be 
writes that originate from a remote NM that had a container (which had already 
been completed) but get delayed in reaching the timeline collector for some 
reason.

I suspect a simple linger might be sufficient, but do we see a case where we 
might miss writes otherwise?

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070660#comment-15070660
 ] 

Naganarasimha G R commented on YARN-3995:
-

Oops, Sorry my mistake ,
Thanks [~sjlee0] for correcting me. [~sjlee0] current code is already waiting 
for a second in a separate thread after AM container is closed (in 
PerNodeTimelineCollectorsAuxService.stopContainer), but the issue with that 
approach is: it just closes after 1 second though the events are still coming, 
but what i am trying to suggest is close/remove the collector only after a 
period of inactivity in the collector. Will that be good considering it will be 
usually getting delayed for metrics ?
if above approach is not required then already existing approach waits for a 
second in separate thread, does it req any change ? (least i can think is few  
threads will be there if more AM's are run from a single NM )

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070643#comment-15070643
 ] 

Sangjin Lee commented on YARN-3995:
---

That would be [~vrushalic], not me. :)

It might be bit better if we can build this "lingering" functionality at the 
per-app collector level. Note that we will have an option of running per-app 
collectors in their own processes. It would be nice if this functionality 
translates to that mode without much work.

Also, note that this linger doesn't need to be too long as we discussed 
offline. I think 1-2 seconds was more than enough?

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070635#comment-15070635
 ] 

Naganarasimha G R commented on YARN-3995:
-

Thanks for the comments [~sjlee0], 
IIUC 2nd point is continuation of the first idea right ?
bq. I am not too knowledgeable about the NM and so not sure if this is 
complicated/infeasible.
{{PerNodeTimelineCollectorsAuxService}} can take this responsibility so i don't 
see any problem to it with NM, right ?

I can think of little modification on top of your idea,
* Once NM notifies the Auxillary service that the app is finished (by container 
finished call in the existing way), {{PerNodeTimelineCollectorsAuxService}} can 
add move this collector to a  zombie collector Map. 
* This map stores the last event published time for the zombie collector. 
* We can have one thread running to check which zombie collector is inactive 
for configurable time period and then remove it

 Thus none of the events are lost till the end. like we can keep this period as 
2 mins and if the collector in the zombie list not active for 2 mins then 
remove it and close it   ?

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-23 Thread Vrushali C (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070185#comment-15070185
 ] 

Vrushali C commented on YARN-3995:
--

Hi [~Naganarasimha]

Thanks for the thoughts on the jira. I was wondering if the following is a 
feasible solution:

- can the NM container maintain a list/map info of  “zombie app ids” for 
AMs/collectors that it is removing?  That way when metrics arrive at the NM 
from other NMs for those zombie app ids, it can see if this was for an app that 
previously had a collector and hence most likely still a valid metric/entity 
and then somehow write that to the backend, perhaps via a “common parent 
collector” process or something.

- we can have the NM periodically prune  this zombie list, perhaps say a few 
days after app completion, remove the info for that app from the zombie app 
list.

I am not too knowledgeable about the NM and so not sure if this is 
complicated/infeasible. 


> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-12-22 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15068068#comment-15068068
 ] 

Naganarasimha G R commented on YARN-3995:
-

Hi [~sjlee0],
As per the discussion we had in the status call, we planned to stop the 
collector after 2 seconds of the AM container finished, but already we are 
having a code which waits for one second and then closes the collector.
Now IIUC the scope of this jira :
# Introduce a configurable period to wait 
# Instead of spawning multiple threads may be we can have single thread which 
does this activity ?

Or do we need to introduce some thing else ?
bq When RM finishes the attempt then it can send one finish event through 
timelineclient
IMO this also will not gurantee that no event is missed. So i think 
configurable wait period is better. Thoughts ?


> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>  Labels: yarn-2928-1st-milestone
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-07-31 Thread Sangjin Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650076#comment-14650076
 ] 

Sangjin Lee commented on YARN-3995:
---

Let me give it some more thought. I'll chime in early next week.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-07-31 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14649990#comment-14649990
 ] 

Naganarasimha G R commented on YARN-3995:
-

Hi [~sjlee0] & [~zjshen],
Hoping for some comments on the approach which has to be taken for this issue.

> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-3995) Some of the NM events are not getting published due race condition when AM container finishes in NM

2015-07-29 Thread Naganarasimha G R (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647017#comment-14647017
 ] 

Naganarasimha G R commented on YARN-3995:
-

Two approaches were discussed till now :
#  we can have timer task which periodically cleans up collector after some 
period and not imm remove it when AM container is finished.
# When RM finishes the attempt then it can send one finish event through 
timelineclient for the ApplicationEntity which is kind of a marker based on 
which NM's TimelineCollectorManager can act upon.



> Some of the NM events are not getting published due race condition when AM 
> container finishes in NM 
> 
>
> Key: YARN-3995
> URL: https://issues.apache.org/jira/browse/YARN-3995
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, timelineserver
>Affects Versions: YARN-2928
>Reporter: Naganarasimha G R
>Assignee: Naganarasimha G R
>
> As discussed in YARN-3045:  While testing in TestDistributedShell found out 
> that few of the container metrics events were failing as there will be race 
> condition. When the AM container finishes and removes the collector for the 
> app, still there is possibility that all the events published for the app by 
> the current NM and other NM are still in pipeline, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)