klsince opened a new pull request, #11740:
URL: https://github.com/apache/pinot/pull/11740

   Currently, table rebalance triggered by user runs at best effort. It could 
fail if the controller running it got restarted; or some servers were not 
stable, making the rebalance timed out while waiting for external view to 
converge with ideal state, etc.
   
   This PR adds support to retry failed table rebalance:
   1. extended rebalance job status updating mechanism to help detect failed 
rebalance. The rebalance job is considered as failed if it's in FAILED status 
or in IN_PROGRESS but didn't send out heartbeat for too long;
   2. added a controller periodic task - RebalanceChecker - to detect failed 
rebalance jobs and do retry.
   3. added configs to disable checker, enable it to emit failure metrics only, 
or enable it and allow it to kick off retry
   
   
   ## Release note ##
   New configs for the RebalanceChecker periodic task:
   1. controller.rebalance.checker.frequencyPeriod: 5min by default, with -1 to 
disable it
   2. controller.rebalanceChecker.initialDelayInSeconds: 2min+ by default
   3. controller.rebalanceChecker.checkOnly: true by default, so just check and 
emit failure metrics
   
   New configs added for RebalanceConfig:
   1. heartbeatIntervalInMs: 300_000 i.e. 5min
   2. heartbeatTimeoutInMs: 3600_000 i.e. 1hr
   3. maxRetry: 3 by default
   
   New metrics to monitor rebalance and its retries:
   1. TABLE_REBALANCE_FAILURE("TableRebalanceFailure", false), emit from 
TableRebalancer.rebalanceTable()
   2. TABLE_REBALANCE_EXECUTION_TIME_MS("tableRebalanceExecutionTimeMs", 
false), emit from TableRebalancer.rebalanceTable()
   3. TABLE_REBALANCE_FAILURE_DETECTED("TableRebalanceFailureDetected", false), 
emit from RebalanceChecker
   4. TABLE_REBALANCE_RETRY("TableRebalanceRetry", false), emit from 
RebalanceChecker
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to