[jira] [Updated] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Naganarasimha G R updated YARN-4350: Attachment: YARN-4350-feature-YARN-2928.008.patch Hi [~sjlee0], I have given my analysis for this issue in the [comment|https://issues.apache.org/jira/browse/YARN-2859?focusedCommentId=15005746&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15005746] of YARN-2859 patch. To just check whether my intended approach fixes the issue i am uploading a patch, if possible please try, Actually i had earlier too faced this race condition but it happening too irregularly and earlier in some other jira too had mention about this but was waiting for YARN-3127 to be verified. If so later to it we need to make this fix. as it has other impacts as mentioned in YARN-3127. > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > Attachments: YARN-4350-feature-YARN-2928.008.patch > > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4350) TestDistributedShell fails
[ https://issues.apache.org/jira/browse/YARN-4350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005748#comment-15005748 ] Naganarasimha G R commented on YARN-4350: - A part of the issues is because of the fix in YARN-2859 > TestDistributedShell fails > -- > > Key: YARN-4350 > URL: https://issues.apache.org/jira/browse/YARN-4350 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Sangjin Lee >Assignee: Naganarasimha G R > > Currently TestDistributedShell does not pass on the feature-YARN-2928 branch. > There seem to be 2 distinct issues. > (1) testDSShellWithoutDomainV2* tests fail sporadically > These test fail more often than not if tested by themselves: > {noformat} > testDSShellWithoutDomainV2DefaultFlow(org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell) > Time elapsed: 30.998 sec <<< FAILURE! > java.lang.AssertionError: Application created event should be published > atleast once expected:<1> but was:<0> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.checkTimelineV2(TestDistributedShell.java:451) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShell(TestDistributedShell.java:326) > at > org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell.testDSShellWithoutDomainV2DefaultFlow(TestDistributedShell.java:207) > {noformat} > They start happening after YARN-4129. I suspect this might have to do with > some timing issue. > (2) the whole test times out > If you run the whole TestDistributedShell test, it times out without fail. > This may or may not have to do with the port change introduced by YARN-2859 > (just a hunch). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
[ https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005746#comment-15005746 ] Naganarasimha G R commented on YARN-2859: - Hi [~vinodkv] & [~sjlee0], Though the addendum patch fixes the TestDistributedShell issue in 2.7.2, it had impacts in ATSv2 branch. On further checking realized that in trunk and 2.7.2 , {{yarn.resourcemanager.system-metrics-publisher.enabled}} was not set to true in TestDistributedShell.setupInternal in but was required to be set in ATSv2 branch. Further to rectify i faced following issues, # In MiniYARNCluster RM servicewrapper is first added and then AHSwrapper, and also actual AHS service is started in a thread, so RM's will be using the wrong timelineclient address(port is zero) as AHS service is not yet initialized. # In Timeline client Impl's *serviceInit* URI for timeline REST service is set. So even though we create the correct service order (as per previous step), RM's SMP will fail to publish, as timelineweb address is got only after the AHS service is started. Even after this (though got the right port) was still facing some issues. So if *MINI YARN cluster is required to be used with system-metrics-publisher enabled*, either we need to start correcting series of issues or use other simpler option {{ServerSocketUtil.getPort(9188, 10)}}, which i feel is safer and used in many other places.But would req different patches for 2.6.2 ! > ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster > -- > > Key: YARN-2859 > URL: https://issues.apache.org/jira/browse/YARN-2859 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Hitesh Shah >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Fix For: 2.8.0, 2.7.2, 2.6.3 > > Attachments: YARN-2859-addendum.txt, YARN-2859.txt > > > In mini cluster, a random port should be used. > Also, the config is not updated to the host that the process got bound to. > {code} > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer > address: localhost:10200 > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer > web address: 0.0.0.0:8188 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4348) ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of zkSessionTimeout
[ https://issues.apache.org/jira/browse/YARN-4348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005722#comment-15005722 ] Tsuyoshi Ozawa commented on YARN-4348: -- Found that testZKRootPathAcls fails because of time out with the patch. I will check it more deeper. {code:title=TestZKRMStateStore-output.txt|} 2015-11-15 02:15:17,324 INFO [main] zookeeper.JUnit4ZKTestRunner (JUnit4ZKTestRunner.java:evaluate(50)) - RUNNING TEST METHOD testZKRootPathAcls ... ... 2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffe zxid:0xfffe txntype:unknown reqpath:n/a 2015-11-15 02:30:12,774 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffe zxid:0xfffe txntype:unknown reqpath:n/a 2015-11-15 02:30:12,775 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms 2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(88)) - Processing request:: sessionid:0x15108ecd3b20001 type:ping cxid:0xfffe zxid:0xfffe txntype:unknown reqpath:n/a 2015-11-15 02:30:14,776 DEBUG [SyncThread:0] server.FinalRequestProcessor (FinalRequestProcessor.java:processRequest(160)) - sessionid:0x15108ecd3b20001 type:ping cxid:0xfffe zxid:0xfffe txntype:unknown reqpath:n/a 2015-11-15 02:30:14,776 DEBUG [main-SendThread(127.0.0.1:11221)] zookeeper.ClientCnxn (ClientCnxn.java:readResponse(717)) - Got ping response for sessionid: 0x15108ecd3b20001 after 0ms ~ {code} > ZKRMStateStore.syncInternal should wait for zkResyncWaitTime instead of > zkSessionTimeout > > > Key: YARN-4348 > URL: https://issues.apache.org/jira/browse/YARN-4348 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.2, 2.6.2 >Reporter: Tsuyoshi Ozawa >Assignee: Tsuyoshi Ozawa > Attachments: YARN-4348.001.patch, YARN-4348.001.patch, log.txt > > > Jian mentioned that the current internal ZK configuration of ZKRMStateStore > can cause a following situation: > 1. syncInternal timeouts, > 2. but sync succeeded later on. > We should use zkResyncWaitTime as the timeout value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4183) Enabling generic application history forces every job to get a timeline service delegation token
[ https://issues.apache.org/jira/browse/YARN-4183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005342#comment-15005342 ] Varun Saxena commented on YARN-4183: bq. i feel yarn.resourcemanager.system-metrics-publisher.enabled is sufficient to be configured. Agree. Enabling system metrics publisher should be considered to be enough to publish events from RM. bq. As far as i view it "yarn.timeline-service.enabled"* name is misleading, it should be more to signify client requires the timeline service's delegation token. Maybe we can use the version config to decide if we have to fetch a token or not (in addition with timeline service enabled config ?). > Enabling generic application history forces every job to get a timeline > service delegation token > > > Key: YARN-4183 > URL: https://issues.apache.org/jira/browse/YARN-4183 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Mit Desai >Assignee: Mit Desai > Attachments: YARN-4183.1.patch > > > When enabling just the Generic History Server and not the timeline server, > the system metrics publisher will not publish the events to the timeline > store as it checks if the timeline server and system metrics publisher are > enabled before creating a timeline client. > To make it work, if the timeline service flag is turned on, it will force > every yarn application to get a delegation token. > Instead of checking if timeline service is enabled, we should be checking if > application history server is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)