[ 
https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Lu updated YARN-2673:
------------------------
    Attachment: YARN-2673-101414.patch

Upload a patch for this issue. TimelineClient will by default retry for a given 
amount of time before throw the exception on posting to server. There are a few 
notes:

1. Retrying vs. discarding timeline data: If we do not adding this retry, 
timeline client will drop the posted data if the first attempt has failed. Had 
a offline discussion with [~vinodkv]. We agreed that blocking the timeline 
client for a short while is better, since we may not want to drop some critical 
timeline data. 

2. Retry behavior configurations: Users can define maximum retry counts, and 
time interval between consecutive retries. We may want to have two levels of 
retry settings: a cluster global settings, managed by yarn-site.xml, and a 
per-application customize setting. For the cluster setting, I've added two 
configuration properties, yarn.timeline-service.client.max-retries (default 30) 
and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also 
provide a customizeRetrySettings method for application specific retry 
settings. 

3. Retry implementation: timeline client does not use RPC, but uses RESTful 
APIs. I'm implementing retry as a jersey filter in this patch. 

4. Tests: I added two new unit tests, one to test the customizeRetrySettings 
API and the other to test if the retry has actually happened when we try to 
post  timeline entities. 

> Add retry for timeline client
> -----------------------------
>
>                 Key: YARN-2673
>                 URL: https://issues.apache.org/jira/browse/YARN-2673
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Li Lu
>            Assignee: Li Lu
>         Attachments: YARN-2673-101414.patch
>
>
> Timeline client now does not handle the case gracefully when the server is 
> down. Jobs from distributed shell may fail due to ATS restart. We may need to 
> add some retry mechanisms to the client. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to