The River QA tests definitely do not include any RetryTask retries. This
is a serious issue, because tasks can get reordered during retries in
ways that would not be permitted if every RetryTask succeeded first time.
I am thinking about ways of forcing retries for test coverage. I have
two main strategies in mind:
1. Add a configuration parameter to force RetryTask to not run the
payload code the first time for some percentage of tasks, and treat
those tasks as having failed once.
2. Keep the tests as they are, but add a high priority workload that
will cause random periods of extreme load for minutes at a time, so that
operations may time out naturally. This may shake loose other
concurrency bugs, if they exist.
Any other ideas? Preferences? Comments?
Patricia