----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60714/#review180646 -----------------------------------------------------------
Ship it! Ship It! - Mehrdad Nurolahzade On July 15, 2017, 11 p.m., Kai Huang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60714/ > ----------------------------------------------------------- > > (Updated July 15, 2017, 11 p.m.) > > > Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh > Kumar Shanmugham, and Zameer Manji. > > > Bugs: AURORA-1940 > https://issues.apache.org/jira/browse/AURORA-1940 > > > Repository: aurora > > > Description > ------- > > aurora job restart request should be idempotent and retryable. > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > > Technically, `aurora job restart` is a non-idempotent operation, thus it was > not retried. However, when a transport exception occurs, the operator has to > babysit simple operations like aurora job restart if it were not retried. > Compared to the requests that were causing problems (admin tasks, job > creating, updates, etc.), restarts in general should be retried rather than > erring on the side of caution. > > Job restart can be divided into three steps: > - 1. get instance status (getTasksWithoutConfigs) > - 2. restart shards (restartShards) > - 3. watch instance until healthy (getTasksWithoutConfigs) > > TTransport exception can be thrown at each of these step, ideally we should > make __ALL__ of the steps above __idempotent__ and retryable. > > The only trickey part is that the `watch` logic is also used in --wait-until > options of job create/add command, making this step retryable will have an > impact on job create/add commands as well. > > In this CR, I will make the first __TWO__ steps retryable since they are > self-contained in job restart command. > If people are OK with this strategy, I'll make the `watch` step retryable as > well. > > __Updates__: > I made the `watch` step in `aurora job restart` retryable in the 3rd > revision. Note the `InstanceWatcher` in `Restarter` relies on > `StatusHelper.get_tasks` method to query the latest instance status. This > method is also invoked by JobMonitor during job create/add. > > The solution here is to pass a `retry` flag to this method, so that it can be > customized for different scenarios. > > __Open Question__: > Currently, `aurora job create --wait-until RUNNING` fails immediately if > there is any transport exception during the `wait` step. Do we need retry for > the `wait` step? > > There are three phases where transport exception could be thrown: > - Phase I: job create request was not sent to scheduler yet, we should retry > the command. > - Phase II: job create request was already sent to scheduler, we should not > retry the command, and exit. > - Phase III: job create command gets to the `wait` step, that means the job > create request must have already been sent to scheduler, we should not have > to retry the command. > > So overall, we do not retry for the `wait` step. The same principle is > applicable to `job add` command. > > > Diffs > ----- > > src/main/python/apache/aurora/client/api/instance_watcher.py > a35fb22edc9a5bb12f286856b170f5cb0170eceb > src/main/python/apache/aurora/client/api/restarter.py > 6600c6b608ee70a02e11ca8550a06b7fb76fd863 > src/main/python/apache/aurora/client/api/task_util.py > fb7c76f42171737701c740fddf893f07211a47a0 > src/test/python/apache/aurora/api_util.py > bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d > src/test/python/apache/aurora/client/api/test_instance_watcher.py > 8fd419f812c6e038e6f7fba0a662da0a6410b794 > src/test/python/apache/aurora/client/api/test_job_monitor.py > 537abd38cd13bcbb615c0224c983868b29027379 > src/test/python/apache/aurora/client/api/test_restarter.py > a81003e79e090bbfa151431366f5394f405d99eb > src/test/python/apache/aurora/client/api/test_task_util.py > 365ef59857003402c4354e06abeaea947d49322b > src/test/python/apache/aurora/client/cli/test_command_hooks.py > a44a25fb1e4357780adca4403b6010d160066b21 > src/test/python/apache/aurora/client/cli/test_create.py > 3b09bb25e919bac2795ccd56bd98657b1f98690b > src/test/python/apache/aurora/client/cli/test_plugins.py > 762735e00bd8051ab64674e7ee359758d3cbeca7 > src/test/python/apache/aurora/client/cli/test_restart.py > cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 > > > Diff: https://reviews.apache.org/r/60714/diff/3/ > > > Testing > ------- > > ./build-support/jenkins/build.sh > > To test the retry logic, TTranport exceptions were randomly thrown at the > client side(for all api calls to the scheduler proxy, there is a 50% chance > of throwing TTranport exception), aurora job restart command was issued > against scheduler, and the output was like: > ``` > vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Performing rolling restart of job devcluster/vagrant/test/hello > (instances: [0]) > INFO] Restarting instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Watching instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Detected RUNNING instance 0 > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Instance 0 has been up and healthy for at least 30 seconds > INFO] All instances were restarted successfully > Job devcluster/vagrant/test/hello restarted successfully > ``` > > __Updates__: > > To ensure that job create/add/kill commands work as expected, TTranport > exceptions were randomly thrown at the client side, aurora job > create/add/kill commands were issued against scheduler: > > __JOB CREATE__: > Job create command succeeds when no transport exception was thrown. > ``` > vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello > /vagrant/hello.aurora > INFO] Creating job hello > INFO] Checking status of devcluster/vagrant/test/hello > Job create succeeded: job > url=http://aurora.local:8081/scheduler/vagrant/test/hello > ``` > > Job create command __retries__(getTasksWithoutConfigs) when transport > exception were thrown __BEFORE__ the job create request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello > /vagrant/hello.aurora > INFO] Creating job hello > INFO] Checking status of devcluster/vagrant/test/hello > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > Job create succeeded: job > url=http://aurora.local:8081/scheduler/vagrant/test/hello > ``` > > Job create command __does NOT retry__ when transport exception were thrown > __AFTER__ the job create request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job create devcluster/vagrant/test/hello > /vagrant/hello.aurora > INFO] Creating job hello > Transport error communicating with scheduler during non-idempotent operation: > Timed out talking to http://aurora.local:8081/api, not retrying > ``` > > __JOB ADD__: > Job add command succeeds when no transport exception was thrown. > ``` > vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1 > INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task > config of instance 0 > ``` > > Job add command __retries__(getTasksWithoutConfigs) when transport exception > were thrown __BEFORE__ the job add request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1 > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task > config of instance 0 > ``` > > Job add command __does NOT retry__ when transport exception were thrown > __AFTER__ the job add request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job add devcluster/vagrant/test/hello/0 1 > INFO] Adding 1 instances to devcluster/vagrant/test/hello using the task > config of instance 0 > Transport error communicating with scheduler during non-idempotent operation: > Timed out talking to http://aurora.local:8081/api, not retrying > ``` > > __JOB KILL__: > Job kill command succeeds when no transport exception was thrown. > ``` > vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello > INFO] Killing tasks for job: devcluster/vagrant/test/hello > INFO] Instances to be killed: [0] > Successfully killed instances [0] > Job killall succeeded > ``` > > Job kill command __retries__(getTasksWithoutConfigs) when transport exception > were thrown __BEFORE__ the job kill request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Killing tasks for job: devcluster/vagrant/test/hello > INFO] Instances to be killed: [0] > Successfully killed instances [0] > Job killall succeeded > ``` > > Job kill command __does NOT retry__ when transport exception were thrown > __AFTER__ the job kill request was sent to scheduler. > ``` > vagrant@aurora:~$ aurora job killall devcluster/vagrant/test/hello > INFO] Killing tasks for job: devcluster/vagrant/test/hello > INFO] Instances to be killed: [0] > Transport error communicating with scheduler during non-idempotent operation: > Timed out talking to http://aurora.local:8081/api, not retrying > ``` > > > Thanks, > > Kai Huang > >