klsince commented on code in PR #11740: URL: https://github.com/apache/pinot/pull/11740#discussion_r1353180549
########## pinot-controller/src/main/java/org/apache/pinot/controller/helix/core/rebalance/RebalanceConfig.java: ########## @@ -84,6 +84,24 @@ public class RebalanceConfig { @ApiModelProperty(example = "false") private boolean _updateTargetTier = false; + // Update job status every this interval as heartbeat, to indicate the job is still actively running. + @JsonProperty("heartbeatIntervalInMs") + @ApiModelProperty(example = "300000") + private long _heartbeatIntervalInMs = 300000L; + + // The job is considered as failed if not updating its status by this timeout, even though it's IN_PROGRESS status. + @JsonProperty("heartbeatTimeoutInMs") + @ApiModelProperty(example = "3600000") + private long _heartbeatTimeoutInMs = 3600000L; + + @JsonProperty("maxRetry") + @ApiModelProperty(example = "3") + private int _maxRetry = 3; + + @JsonProperty("retryInitialDelayInMs") + @ApiModelProperty(example = "300000") + private long _retryInitialDelayInMs = 300000L; Review Comment: I think this is still useful,to avoid herd effect particularly for clusters with many tables. If many tables are doing rebalance but all failed like because controllers got restarted, then all rebalances would get retried all at once when RebalanceChecker task runs next time. If wanted, setting this to 0 can disable the backoff mechanism. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org