klsince opened a new pull request, #11740:
URL: https://github.com/apache/pinot/pull/11740
Currently, table rebalance triggered by user runs at best effort. It could
fail if the controller running it got restarted; or some servers were not
stable, making the rebalance timed out while waiting for external view to
converge with ideal state, etc.
This PR adds support to retry failed table rebalance:
1. extended rebalance job status updating mechanism to help detect failed
rebalance. The rebalance job is considered as failed if it's in FAILED status
or in IN_PROGRESS but didn't send out heartbeat for too long;
2. added a controller periodic task - RebalanceChecker - to detect failed
rebalance jobs and do retry.
3. added configs to disable checker, enable it to emit failure metrics only,
or enable it and allow it to kick off retry
## Release note ##
New configs for the RebalanceChecker periodic task:
1. controller.rebalance.checker.frequencyPeriod: 5min by default, with -1 to
disable it
2. controller.rebalanceChecker.initialDelayInSeconds: 2min+ by default
3. controller.rebalanceChecker.checkOnly: true by default, so just check and
emit failure metrics
New configs added for RebalanceConfig:
1. heartbeatIntervalInMs: 300_000 i.e. 5min
2. heartbeatTimeoutInMs: 3600_000 i.e. 1hr
3. maxRetry: 3 by default
New metrics to monitor rebalance and its retries:
1. TABLE_REBALANCE_FAILURE("TableRebalanceFailure", false), emit from
TableRebalancer.rebalanceTable()
2. TABLE_REBALANCE_EXECUTION_TIME_MS("tableRebalanceExecutionTimeMs",
false), emit from TableRebalancer.rebalanceTable()
3. TABLE_REBALANCE_FAILURE_DETECTED("TableRebalanceFailureDetected", false),
emit from RebalanceChecker
4. TABLE_REBALANCE_RETRY("TableRebalanceRetry", false), emit from
RebalanceChecker
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]