----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/60714/#review179935 -----------------------------------------------------------
Ship it! Master (a922b05) is green with this patch. ./build-support/jenkins/build.sh I will refresh this build result if you post a review containing "@ReviewBot retry" - Aurora ReviewBot On July 7, 2017, 8:20 p.m., Kai Huang wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/60714/ > ----------------------------------------------------------- > > (Updated July 7, 2017, 8:20 p.m.) > > > Review request for Aurora, David McLaughlin, Mehrdad Nurolahzade, Santhosh > Kumar Shanmugham, and Zameer Manji. > > > Bugs: AURORA-1940 > https://issues.apache.org/jira/browse/AURORA-1940 > > > Repository: aurora > > > Description > ------- > > aurora job restart request should be idempotent and retryable. > > There was a recent change to the Aurora client to provide "at most once" > instead of "at least once" retries for non-idempotent operations. See: > https://github.com/apache/aurora/commit/f1e25375def5a047da97d8bdfb47a3a9101568f6 > > Technically, `aurora job restart` is a non-idempotent operation, thus it was > not retried. However, when a transport exception occurs, the operator has to > babysit simple operations like aurora job restart if it were not retried. > Compared to the requests that were causing problems (admin tasks, job > creating, updates, etc.), restarts in general should be retried rather than > erring on the side of caution. > > Job restart can be divided into three steps: > - 1. get instance status (getTasksWithoutConfigs) > - 2. restart shards (restartShards) > - 3. watch instance until healthy (getTasksWithoutConfigs) > > TTransport exception can be thrown at each of these step, ideally we should > make __ALL__ of the steps above __idempotent__ and retryable. > > However, given that the `watch` logic is also used in --wait options of job > create/kill command), making this step retryable will have an impact job > create/kill commands as well. IMO, it's probably OK for us to retry the watch > during job create/kill since the watch step is readonly. > > In this CR, I will make the first __TWO__ steps retryable since they are > self-contained in job restart command. > If people are OK with this strategy, I'll make the `watch` step retryable as > well. > > > Diffs > ----- > > src/main/python/apache/aurora/client/api/restarter.py > 6600c6b608ee70a02e11ca8550a06b7fb76fd863 > src/test/python/apache/aurora/api_util.py > bd6e3a6abb5a2eec621121fc1b4d547c54a9d19d > src/test/python/apache/aurora/client/api/test_restarter.py > a81003e79e090bbfa151431366f5394f405d99eb > src/test/python/apache/aurora/client/cli/test_restart.py > cb4adc55caec354d2cdf6a92ff2dd8a3b4a9eca8 > > > Diff: https://reviews.apache.org/r/60714/diff/2/ > > > Testing > ------- > > ./build-support/jenkins/build.sh > > To test the retry logic, random TTranport exceptions were thrown at the > client side(for all api calls to the scheduler proxy), aurora job restart > command was issued against scheduler, and the output was like: > ``` > vagrant@aurora:~$ aurora job restart devcluster/vagrant/test/hello > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Performing rolling restart of job devcluster/vagrant/test/hello > (instances: [0]) > INFO] Restarting instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Watching instances: [0] > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Detected RUNNING instance 0 > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > WARN] Transport error communicating with scheduler: Timed out talking to > http://aurora.local:8081/api, retrying... > INFO] Instance 0 has been up and healthy for at least 30 seconds > INFO] All instances were restarted successfully > Job devcluster/vagrant/test/hello restarted successfully > ``` > > > Thanks, > > Kai Huang > >