Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Aurora ReviewBot Fri, 07 Jul 2017 13:36:22 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/#review179935
-----------------------------------------------------------



Ship it!




Master (a922b05) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot 
retry"

- Aurora ReviewBot


On July 7, 2017, 8:20 p.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60714/
> -----------------------------------------------------------
> 
> (Updated July 7, 2017, 8:20 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh 
> Kumar Shanmugham, and Zameer Manji.
> 
> 
> Bugs: AURORA-1940
>     https://issues.apache.org/jira/browse/AURORA-1940
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> aurora job restart request should be idempotent and retryable.
> 
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See: 
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> 
> Technically, `aurora job restart` is a non-idempotent operation, thus it was 
> not retried. However, when a transport exception occurs, the operator has to 
> babysit simple operations like aurora job restart if it were not retried. 
> Compared to the requests that were causing problems (admin tasks, job 
> creating, updates, etc.), restarts in general should be retried rather than 
> erring on the side of caution.
> 
> Job restart can be divided into three steps:
> - 1. get instance status (getTasksWithoutConfigs)
> - 2. restart shards (restartShards)
> - 3. watch instance until healthy (getTasksWithoutConfigs)
> 
> TTransport exception can be thrown at each of these step, ideally we should 
> make __ALL__ of the steps above __idempotent__ and retryable. 
> 
> However, given that the `watch` logic is also used in --wait options of job 
> create/kill command), making this step retryable will have an impact job 
> create/kill commands as well. IMO, it's probably OK for us to retry the watch 
> during job create/kill since the watch step is readonly.
> 
> In this CR, I will make the first __TWO__ steps retryable since they are 
> self-contained in job restart command.
> If people are OK with this strategy, I'll make the `watch` step retryable as 
> well.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/api/restarter.py 
> 6600c6b608ee70a02e11ca8550a06b7fb76fd863 
>   src/test/python/apache/aurora/api_util.py 
> bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d 
>   src/test/python/apache/aurora/client/api/test_restarter.py 
> a81003e79e090bbfa151431366f5394f405d99eb 
>   src/test/python/apache/aurora/client/cli/test_restart.py 
> cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 
> 
> 
> Diff: https://reviews.apache.org/r/60714/diff/2/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
>  
>  To test the retry logic, random TTranport exceptions were thrown at the 
> client side(for all api calls to the scheduler proxy), aurora job restart 
> command was issued against scheduler, and the output was like:
> ```
>  vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Performing rolling restart of job devcluster/vagrant/test/hello 
> (instances: [0])
>  INFO] Restarting instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Watching instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Detected RUNNING instance 0
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Instance 0 has been up and healthy for at least 30 seconds
>  INFO] All instances were restarted successfully
>  Job devcluster/vagrant/test/hello restarted successfully
> ```
> 
> 
> Thanks,
> 
> Kai Huang
> 
>

Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Reply via email to