[ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174783#comment-14174783 ]
Zhijie Shen commented on YARN-2673: ----------------------------------- [~gtCarrera], thanks for working on this issue. bq. Then, on top of this, we can add retry mechanism to delegation token calls for secured clusters. GetDelegationToken is not via the jersey client. It needs to be fixed separately. bq. We agreed that blocking the timeline client for a short while is better, since we may not want to drop some critical timeline data. Make sense to me. The jersey filter approach looks fine to me. I've some comments w.r.t. the patch: 1. The variable is only used for testing? Shall we mark it \@VisibleForTesting, and let the test case access it directly? Also maxRetries and retryInterval. After retried is set to true first time. It is always true, which means it's not useful for asserting the second request. Do we really need it? Or we can spy the filter, and check how many times getNext() has been invoked? {code} private boolean retried = false; {code} 2. Use "public synchronized void changeRetrySettings(int maxRetries, long interval) \{"? {code} // Customize retry settings synchronously public void changeRetrySettings(int maxRetries, long interval) { synchronized(this) { this.maxRetries = maxRetries; this.retryInterval = interval; } } {code} 3. Why can it be -1 given the configuration is correct? {code} } else { // note that maxRetries may be -1 at the very beginning LOG.info("Connection Timeout (" + cr.getURI() + "), will keep retrying."); } {code} 4. response is unused. And we need to add Assert.fail() after this statement to ensure the test case will fail if the exception is not thrown. {code} TimelinePutResponse response = client.putEntities(generateEntity()); {code} 5. Is this API really necessary? We don't have similar configurations for RPC based APIs, for example YarnClient methods, right? According to the discussion on the umbrella YARN-1530, it's still arguable about how to make timeline client reliable, which may affect this configuration. If we add this API to timeline client, we have to support it forwardly. {code} @Public public abstract void customizeRetrySettings(int maxRetries, long retryIntervalMs); {code} > Add retry for timeline client put APIs > -------------------------------------- > > Key: YARN-2673 > URL: https://issues.apache.org/jira/browse/YARN-2673 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Li Lu > Assignee: Li Lu > Attachments: YARN-2673-101414-1.patch, YARN-2673-101414-2.patch, > YARN-2673-101414.patch > > > Timeline client now does not handle the case gracefully when the server is > down. Jobs from distributed shell may fail due to ATS restart. We may need to > add some retry mechanisms to the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)