Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Mehrdad Nurolahzade Sun, 16 Jul 2017 15:42:26 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/#review180646
-----------------------------------------------------------



Ship it!




Ship It!

- Mehrdad Nurolahzade


On July 15, 2017, 11 p.m., Kai Huang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60714/
> -----------------------------------------------------------
> 
> (Updated July 15, 2017, 11 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh 
> Kumar Shanmugham, and Zameer Manji.
> 
> 
> Bugs: AURORA-1940
>     https://issues.apache.org/jira/browse/AURORA-1940
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> aurora job restart request should be idempotent and retryable.
> 
> There was a recent change to the Aurora client to provide "at most once" 
> instead of "at least once" retries for non-idempotent operations. See: 
> https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6
> 
> Technically, `aurora job restart` is a non-idempotent operation, thus it was 
> not retried. However, when a transport exception occurs, the operator has to 
> babysit simple operations like aurora job restart if it were not retried. 
> Compared to the requests that were causing problems (admin tasks, job 
> creating, updates, etc.), restarts in general should be retried rather than 
> erring on the side of caution.
> 
> Job restart can be divided into three steps:
> - 1. get instance status (getTasksWithoutConfigs)
> - 2. restart shards (restartShards)
> - 3. watch instance until healthy (getTasksWithoutConfigs)
> 
> TTransport exception can be thrown at each of these step, ideally we should 
> make __ALL__ of the steps above __idempotent__ and retryable. 
> 
> The only trickey part is that the `watch` logic is also used in --wait-until 
> options of job create/add command, making this step retryable will have an 
> impact on job create/add commands as well.
> 
> In this CR, I will make the first __TWO__ steps retryable since they are 
> self-contained in job restart command.
> If people are OK with this strategy, I'll make the `watch` step retryable as 
> well.
> 
> __Updates__:
> I made the `watch` step in `aurora job restart` retryable in the 3rd 
> revision. Note the `InstanceWatcher` in `Restarter` relies on 
> `StatusHelper.get_tasks` method to query the latest instance status. This 
> method is also invoked by JobMonitor during job create/add. 
> 
> The solution here is to pass a `retry` flag to this method, so that it can be 
> customized for different scenarios.
> 
> __Open Question__:
> Currently, `aurora job create --wait-until RUNNING` fails immediately if 
> there is any transport exception during the `wait` step. Do we need retry for 
> the `wait` step?
> 
> There are three phases where transport exception could be thrown:
> - Phase I: job create request was not sent to scheduler yet, we should retry 
> the command.
> - Phase II: job create request was already sent to scheduler, we should not 
> retry the command, and exit.
> - Phase III: job create command gets to the `wait` step, that means the job 
> create request must have already been sent to scheduler, we should not have 
> to retry the command.
> 
> So overall, we do not retry for the `wait` step. The same principle is 
> applicable to `job add` command.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/api/instance_watcher.py 
> a35fb22edc9a5bb12f286856b170f5cb0170eceb 
>   src/main/python/apache/aurora/client/api/restarter.py 
> 6600c6b608ee70a02e11ca8550a06b7fb76fd863 
>   src/main/python/apache/aurora/client/api/task_util.py 
> fb7c76f42171737701c740fddf893f07211a47a0 
>   src/test/python/apache/aurora/api_util.py 
> bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d 
>   src/test/python/apache/aurora/client/api/test_instance_watcher.py 
> 8fd419f812c6e038e6f7fba0a662da0a6410b794 
>   src/test/python/apache/aurora/client/api/test_job_monitor.py 
> 537abd38cd13bcbb615c0224c983868b29027379 
>   src/test/python/apache/aurora/client/api/test_restarter.py 
> a81003e79e090bbfa151431366f5394f405d99eb 
>   src/test/python/apache/aurora/client/api/test_task_util.py 
> 365ef59857003402c4354e06abeaea947d49322b 
>   src/test/python/apache/aurora/client/cli/test_command_hooks.py 
> a44a25fb1e4357780adca4403b6010d160066b21 
>   src/test/python/apache/aurora/client/cli/test_create.py 
> 3b09bb25e919bac2795ccd56bd98657b1f98690b 
>   src/test/python/apache/aurora/client/cli/test_plugins.py 
> 762735e00bd8051ab64674e7ee359758d3cbeca7 
>   src/test/python/apache/aurora/client/cli/test_restart.py 
> cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 
> 
> 
> Diff: https://reviews.apache.org/r/60714/diff/3/
> 
> 
> Testing
> -------
> 
> ./build-support/jenkins/build.sh
>  
>  To test the retry logic, TTranport exceptions were randomly thrown at the 
> client side(for all api calls to the scheduler proxy, there is a 50% chance 
> of throwing TTranport exception), aurora job restart command was issued 
> against scheduler, and the output was like:
> ```
>  vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Performing rolling restart of job devcluster/vagrant/test/hello 
> (instances: [0])
>  INFO] Restarting instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Watching instances: [0]
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Detected RUNNING instance 0
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Instance 0 has been up and healthy for at least 30 seconds
>  INFO] All instances were restarted successfully
>  Job devcluster/vagrant/test/hello restarted successfully
> ```
> 
> __Updates__:
> 
> To ensure that job create/add/kill commands work as expected, TTranport 
> exceptions were randomly thrown at the client side, aurora job 
> create/add/kill commands were issued against scheduler:
> 
> __JOB CREATE__:
> Job create command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
> /vagrant/hello.aurora
>  INFO] Creating job hello
>  INFO] Checking status of devcluster/vagrant/test/hello
> Job create succeeded: job 
> url=http://aurora.local:8081/scheduler/vagrant/test/hello
> ```
> 
> Job create command __retries__(getTasksWithoutConfigs) when transport 
> exception were thrown __BEFORE__ the job create request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
> /vagrant/hello.aurora
>  INFO] Creating job hello
>  INFO] Checking status of devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
> Job create succeeded: job 
> url=http://aurora.local:8081/scheduler/vagrant/test/hello
> ```
> 
> Job create command __does NOT retry__ when transport exception were thrown 
> __AFTER__ the job create request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
> /vagrant/hello.aurora
>  INFO] Creating job hello
> Transport error communicating with scheduler during non-idempotent operation: 
> Timed out talking to http://aurora.local:8081/api, not retrying
> ```
> 
> __JOB ADD__:
> Job add command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
> config of instance 0
> ```
> 
> Job add command __retries__(getTasksWithoutConfigs) when transport exception 
> were thrown __BEFORE__ the job add request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
> config of instance 0
> ```
> 
> Job add command __does NOT retry__ when transport exception were thrown 
> __AFTER__ the job add request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
>  INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
> config of instance 0
> Transport error communicating with scheduler during non-idempotent operation: 
> Timed out talking to http://aurora.local:8081/api, not retrying
> ```
> 
> __JOB KILL__:
> Job kill command succeeds when no transport exception was thrown. 
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Successfully killed instances [0]
> Job killall succeeded
> ```
> 
> Job kill command __retries__(getTasksWithoutConfigs) when transport exception 
> were thrown __BEFORE__ the job kill request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  WARN] Transport error communicating with scheduler: Timed out talking to 
> http://aurora.local:8081/api, retrying...
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Successfully killed instances [0]
> Job killall succeeded
> ```
> 
> Job kill command __does NOT retry__ when transport exception were thrown 
> __AFTER__ the job kill request was sent to scheduler.
> ```
> vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
>  INFO] Killing tasks for job: devcluster/vagrant/test/hello
>  INFO] Instances to be killed: [0]
> Transport error communicating with scheduler during non-idempotent operation: 
> Timed out talking to http://aurora.local:8081/api, not retrying
> ```
> 
> 
> Thanks,
> 
> Kai Huang
> 
>

Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Reply via email to