klsince commented on PR #11740:
URL: https://github.com/apache/pinot/pull/11740#issuecomment-1747888540

   > One concern I have is: if the initial rebalance does not succeed because 
too many helix messages flood servers, retry will improve workload on servers, 
which may lead to more severe problem. To alleviate this issue,
   > 
   > 1. is there a way to stop the failed or running table rebalance?
   > 2. is there a way to drop existing table rebalance helix messages?
   
   Good questions! 
   for 1: I added a sanity check when rebalance job updates its job progress 
status in ZK. If the job progress is FAILED, it aborts self. With that, before 
kicking off a retry, the RebalanceChecker aborts existing jobs by setting their 
status to FAILED.
   
   for 2. I’m not aware of way to clean up helix msg. Because retry continues 
the rebablance from where the ideal state is, there wouldn’t be too many 
redundant helix transition msgs iiuc.
   
   But your concern reminds me that I should add retry backoff and jitters so 
that RebalanceChecker doesn’t retry too soon after last failure and avoid 
retrying many tables all at once, flooding the servers. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to