klsince commented on PR #11740: URL: https://github.com/apache/pinot/pull/11740#issuecomment-1747888540
> One concern I have is: if the initial rebalance does not succeed because too many helix messages flood servers, retry will improve workload on servers, which may lead to more severe problem. To alleviate this issue, > > 1. is there a way to stop the failed or running table rebalance? > 2. is there a way to drop existing table rebalance helix messages? Good questions! for 1: I added a sanity check when rebalance job updates its job progress status in ZK. If the job progress is FAILED, it aborts self. With that, before kicking off a retry, the RebalanceChecker aborts existing jobs by setting their status to FAILED. for 2. I’m not aware of way to clean up helix msg. Because retry continues the rebablance from where the ideal state is, there wouldn’t be too many redundant helix transition msgs iiuc. But your concern reminds me that I should add retry backoff and jitters so that RebalanceChecker doesn’t retry too soon after last failure and avoid retrying many tables all at once, flooding the servers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
