----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60714/#review180637 -----------------------------------------------------------
Ship it! Ship It! - David McLaughlin On July 7, 2017, 11:52 p.m., Kai Huang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60714/ > ----------------------------------------------------------- > > (Updated July 7, 2017, 11:52 p.m.) > > > Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh > Kumar Shanmugham, and Zameer Manji. > > > Bugs: AURORA-1940 > https://issues.apache.org/jira/browse/AURORA-1940 > > > Repository: aurora > > > Description > ------- > > aurora job restart request should be idempotent and retryable. > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > > Technically, `aurora job restart` is a non-idempotent operation, thus it was > not retried. However, when a transport exception occurs, the operator has to > babysit simple operations like aurora job restart if it were not retried. > Compared to the requests that were causing problems (admin tasks, job > creating, updates, etc.), restarts in general should be retried rather than > erring on the side of caution. > > Job restart can be divided into three steps: > - 1. get instance status (getTasksWithoutConfigs) > - 2. restart shards (restartShards) > - 3. watch instance until healthy (getTasksWithoutConfigs) > > TTransport exception can be thrown at each of these step, ideally we should > make __ALL__ of the steps above __idempotent__ and retryable. > > The only trickey part is that the `watch` logic is also used in --wait-until > options of job create/add command, making this step retryable will have an > impact on job create/add commands as well. > > In this CR, I will make the first __TWO__ steps retryable since they are > self-contained in job restart command. > If people are OK with this strategy, I'll make the `watch` step retryable as > well. > > __Updates__: > I made the `watch` step in `aurora job restart` retryable in the 3rd > revision. Note the `InstanceWatcher` in `Restarter` relies on > `StatusHelper.get_tasks` method to query the latest instance status. This > method is also invoked by JobMonitor during job create/add. > > The solution here is to pass a `retry` flag to this method, so that it can be > customized for different scenarios. > > __Open Question__: > Currently, `aurora job create --wait-until RUNNING` fails immediately if > there is any transport exception during the `wait` step. Do we need retry for > the `wait` step? > > > Diffs > ----- > > src/main/python/apache/aurora/client/api/instance_watcher.py > a35fb22edc9a5bb12f286856b170f5cb0170eceb > src/main/python/apache/aurora/client/api/restarter.py > 6600c6b608ee70a02e11ca8550a06b7fb76fd863 > src/main/python/apache/aurora/client/api/task_util.py > fb7c76f42171737701c740fddf893f07211a47a0 > src/test/python/apache/aurora/api_util.py > bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d > src/test/python/apache/aurora/client/api/test_instance_watcher.py > 8fd419f812c6e038e6f7fba0a662da0a6410b794 > src/test/python/apache/aurora/client/api/test_job_monitor.py > 537abd38cd13bcbb615c0224c983868b29027379 > src/test/python/apache/aurora/client/api/test_restarter.py > a81003e79e090bbfa151431366f5394f405d99eb > src/test/python/apache/aurora/client/api/test_task_util.py > 365ef59857003402c4354e06abeaea947d49322b > src/test/python/apache/aurora/client/cli/test_command_hooks.py > a44a25fb1e4357780adca4403b6010d160066b21 > src/test/python/apache/aurora/client/cli/test_create.py > 3b09bb25e919bac2795ccd56bd98657b1f98690b > src/test/python/apache/aurora/client/cli/test_plugins.py > 762735e00bd8051ab64674e7ee359758d3cbeca7 > src/test/python/apache/aurora/client/cli/test_restart.py > cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 > > > Diff: https://reviews.apache.org/r/60714/diff/3/ > > > Testing > ------- > > ./build-support/jenkins/build.sh > > To test the retry logic, TTranport exceptions were randomly thrown at the > client side(for all api calls to the scheduler proxy, there is a 50% chance > of throwing TTranport exception), aurora job restart command was issued > against scheduler, and the output was like: > ``` > vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Performing rolling restart of job devcluster/vagrant/test/hello > (instances: [0]) > INFO] Restarting instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Watching instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Detected RUNNING instance 0 > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Instance 0 has been up and healthy for at least 30 seconds > INFO] All instances were restarted successfully > Job devcluster/vagrant/test/hello restarted successfully > ``` > > > Thanks, > > Kai Huang > >