[ https://issues.apache.org/jira/browse/YARN-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552091#comment-14552091 ]
Rohith commented on YARN-3646: ------------------------------ Thanks for updating the patch, some comments on tests # I think we can remove the tests added in the hadoop-common project, since yarn-client verifies required funcitionality. And basically hadoop-common test was mocking the RMProxy functionality which test was passing without RMProxy fix also. # code never reach {{Assert.fail("");}}. better to remove it # Catch the ApplicationNotFoundException instead of catching throwable. I think you can add {{expected = ApplicationNotFoundException.class}} in the @Test annotation like below. {code} @Test(timeout = 30000, expected = ApplicationNotFoundException.class) public void testClientWithRetryPolicyForEver() throws Exception { YarnConfiguration conf = new YarnConfiguration(); conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, -1); ResourceManager rm = null; YarnClient yarnClient = null; try { // start rm rm = new ResourceManager(); rm.init(conf); rm.start(); yarnClient = YarnClient.createYarnClient(); yarnClient.init(conf); yarnClient.start(); // create invalid application id ApplicationId appId = ApplicationId.newInstance(1430126768987L, 10645); // RM should throw ApplicationNotFoundException exception yarnClient.getApplicationReport(appId); } finally { if (yarnClient != null) { yarnClient.stop(); } if (rm != null) { rm.stop(); } } } {code} # can you rename the test name with actual functionality test, like {{testShouldNotRetryForeverForNonNetworkExceptions}} > Applications are getting stuck some times in case of retry policy forever > ------------------------------------------------------------------------- > > Key: YARN-3646 > URL: https://issues.apache.org/jira/browse/YARN-3646 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Reporter: Raju Bairishetti > Attachments: YARN-3646.001.patch, YARN-3646.patch > > > We have set *yarn.resourcemanager.connect.wait-ms* to -1 to use FOREVER > retry policy. > Yarn client is infinitely retrying in case of exceptions from the RM as it is > using retrying policy as FOREVER. The problem is it is retrying for all kinds > of exceptions (like ApplicationNotFoundException), even though it is not a > connection failure. Due to this my application is not progressing further. > *Yarn client should not retry infinitely in case of non connection failures.* > We have written a simple yarn-client which is trying to get an application > report for an invalid or older appId. ResourceManager is throwing an > ApplicationNotFoundException as this is an invalid or older appId. But > because of retry policy FOREVER, client is keep on retrying for getting the > application report and ResourceManager is throwing > ApplicationNotFoundException continuously. > {code} > private void testYarnClientRetryPolicy() throws Exception{ > YarnConfiguration conf = new YarnConfiguration(); > conf.setInt(YarnConfiguration.RESOURCEMANAGER_CONNECT_MAX_WAIT_MS, > -1); > YarnClient yarnClient = YarnClient.createYarnClient(); > yarnClient.init(conf); > yarnClient.start(); > ApplicationId appId = ApplicationId.newInstance(1430126768987L, > 10645); > ApplicationReport report = yarnClient.getApplicationReport(appId); > } > {code} > *RM logs:* > {noformat} > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 21 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875162 Retry#0 > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1430126768987_10645' doesn't exist in RM. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) > .... > 15/05/14 16:33:24 INFO ipc.Server: IPC Server handler 47 on 8032, call > org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport > from 10.14.120.231:61621 Call#875163 Retry#0 > .... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)