Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Kai Huang Sat, 15 Jul 2017 23:01:11 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60714/
-----------------------------------------------------------


(Updated July 16, 2017, 6 a.m.)


Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh 
Kumar Shanmugham, and Zameer Manji.


Changes
-------

Update test results to show that job create/add/kill commands work as expected.


Bugs: AURORA-1940
    https://issues.apache.org/jira/browse/AURORA-1940


Repository: aurora


Description (updated)
-------

aurora job restart request should be idempotent and retryable.

There was a recent change to the Aurora client to provide "at most once" 
instead of "at least once" retries for non-idempotent operations. See: 
https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6

Technically, `aurora job restart` is a non-idempotent operation, thus it was 
not retried. However, when a transport exception occurs, the operator has to 
babysit simple operations like aurora job restart if it were not retried. 
Compared to the requests that were causing problems (admin tasks, job creating, 
updates, etc.), restarts in general should be retried rather than erring on the 
side of caution.

Job restart can be divided into three steps:
- 1. get instance status (getTasksWithoutConfigs)
- 2. restart shards (restartShards)
- 3. watch instance until healthy (getTasksWithoutConfigs)

TTransport exception can be thrown at each of these step, ideally we should 
make __ALL__ of the steps above __idempotent__ and retryable. 

The only trickey part is that the `watch` logic is also used in --wait-until 
options of job create/add command, making this step retryable will have an 
impact on job create/add commands as well.

In this CR, I will make the first __TWO__ steps retryable since they are 
self-contained in job restart command.
If people are OK with this strategy, I'll make the `watch` step retryable as 
well.

__Updates__:
I made the `watch` step in `aurora job restart` retryable in the 3rd revision. 
Note the `InstanceWatcher` in `Restarter` relies on `StatusHelper.get_tasks` 
method to query the latest instance status. This method is also invoked by 
JobMonitor during job create/add. 

The solution here is to pass a `retry` flag to this method, so that it can be 
customized for different scenarios.

__Open Question__:
Currently, `aurora job create --wait-until RUNNING` fails immediately if there 
is any transport exception during the `wait` step. Do we need retry for the 
`wait` step?

There are three phases where transport exception could be thrown:
- Phase I: job create request was not sent to scheduler yet, we should retry 
the command.
- Phase II: job create request was already sent to scheduler, we should not 
retry the command, and exit.
- Phase III: job create command gets to the `wait` step, that means the job 
create request must have already been sent to scheduler, we should not have to 
retry the command.

So overall, we do not retry for the `wait` step. The same principle is 
applicable to `job add` command.


Diffs
-----

  src/main/python/apache/aurora/client/api/instance_watcher.py 
a35fb22edc9a5bb12f286856b170f5cb0170eceb 
  src/main/python/apache/aurora/client/api/restarter.py 
6600c6b608ee70a02e11ca8550a06b7fb76fd863 
  src/main/python/apache/aurora/client/api/task_util.py 
fb7c76f42171737701c740fddf893f07211a47a0 
  src/test/python/apache/aurora/api_util.py 
bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d 
  src/test/python/apache/aurora/client/api/test_instance_watcher.py 
8fd419f812c6e038e6f7fba0a662da0a6410b794 
  src/test/python/apache/aurora/client/api/test_job_monitor.py 
537abd38cd13bcbb615c0224c983868b29027379 
  src/test/python/apache/aurora/client/api/test_restarter.py 
a81003e79e090bbfa151431366f5394f405d99eb 
  src/test/python/apache/aurora/client/api/test_task_util.py 
365ef59857003402c4354e06abeaea947d49322b 
  src/test/python/apache/aurora/client/cli/test_command_hooks.py 
a44a25fb1e4357780adca4403b6010d160066b21 
  src/test/python/apache/aurora/client/cli/test_create.py 
3b09bb25e919bac2795ccd56bd98657b1f98690b 
  src/test/python/apache/aurora/client/cli/test_plugins.py 
762735e00bd8051ab64674e7ee359758d3cbeca7 
  src/test/python/apache/aurora/client/cli/test_restart.py 
cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 


Diff: https://reviews.apache.org/r/60714/diff/3/


Testing (updated)
-------

./build-support/jenkins/build.sh
 
 To test the retry logic, TTranport exceptions were randomly thrown at the 
client side(for all api calls to the scheduler proxy, there is a 50% chance of 
throwing TTranport exception), aurora job restart command was issued against 
scheduler, and the output was like:
```
 vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Performing rolling restart of job devcluster/vagrant/test/hello 
(instances: [0])
 INFO] Restarting instances: [0]
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Watching instances: [0]
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Detected RUNNING instance 0
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Instance 0 has been up and healthy for at least 30 seconds
 INFO] All instances were restarted successfully
 Job devcluster/vagrant/test/hello restarted successfully
```

__Updates__:

To ensure that job create/add/kill commands work as expected, TTranport 
exceptions were randomly thrown at the client side, aurora job create/add/kill 
commands were issued against scheduler:

__JOB CREATE__:
Job create command succeeds when no transport exception was thrown. 
```
vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
/vagrant/hello.aurora
 INFO] Creating job hello
 INFO] Checking status of devcluster/vagrant/test/hello
Job create succeeded: job 
url=http://aurora.local:8081/scheduler/vagrant/test/hello
```

Job create command __retries__(getTasksWithoutConfigs) when transport exception 
were thrown __BEFORE__ the job create request was sent to scheduler.
```
vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
/vagrant/hello.aurora
 INFO] Creating job hello
 INFO] Checking status of devcluster/vagrant/test/hello
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
Job create succeeded: job 
url=http://aurora.local:8081/scheduler/vagrant/test/hello
```

Job create command __does NOT retry__ when transport exception were thrown 
__AFTER__ the job create request was sent to scheduler.
```
vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello 
/vagrant/hello.aurora
 INFO] Creating job hello
Transport error communicating with scheduler during non-idempotent operation: 
Timed out talking to http://aurora.local:8081/api, not retrying
```

__JOB ADD__:
Job add command succeeds when no transport exception was thrown. 
```
vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
 INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
config of instance 0
```

Job add command __retries__(getTasksWithoutConfigs) when transport exception 
were thrown __BEFORE__ the job add request was sent to scheduler.
```
vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
config of instance 0
```

Job add command __does NOT retry__ when transport exception were thrown 
__AFTER__ the job add request was sent to scheduler.
```
vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1
 INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task 
config of instance 0
Transport error communicating with scheduler during non-idempotent operation: 
Timed out talking to http://aurora.local:8081/api, not retrying
```

__JOB KILL__:
Job kill command succeeds when no transport exception was thrown. 
```
vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
 INFO] Killing tasks for job: devcluster/vagrant/test/hello
 INFO] Instances to be killed: [0]
Successfully killed instances [0]
Job killall succeeded
```

Job kill command __retries__(getTasksWithoutConfigs) when transport exception 
were thrown __BEFORE__ the job kill request was sent to scheduler.
```
vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
 WARN] Transport error communicating with scheduler: Timed out talking to 
http://aurora.local:8081/api, retrying...
 INFO] Killing tasks for job: devcluster/vagrant/test/hello
 INFO] Instances to be killed: [0]
Successfully killed instances [0]
Job killall succeeded
```

Job kill command __does NOT retry__ when transport exception were thrown 
__AFTER__ the job kill request was sent to scheduler.
```
vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello
 INFO] Killing tasks for job: devcluster/vagrant/test/hello
 INFO] Instances to be killed: [0]
Transport error communicating with scheduler during non-idempotent operation: 
Timed out talking to http://aurora.local:8081/api, not retrying
```


Thanks,

Kai Huang

Re: Review Request 60714: aurora job restart request should be idempotent and retryable

Reply via email to