[ https://issues.apache.org/jira/browse/ZOOKEEPER-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16256099#comment-16256099 ]
ASF GitHub Bot commented on ZOOKEEPER-2849: ------------------------------------------- Github user afine commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/419#discussion_r151553425 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/ExponentialBackoffStrategy.java --- @@ -0,0 +1,192 @@ +package org.apache.zookeeper.server.quorum; + +/** + * A {@link BackoffStrategy} that increases the wait time between each + * interval up to the configured maximum wait time. + */ +public class ExponentialBackoffStrategy implements BackoffStrategy { + + // Sensible default values to use if not set by the user + private static final long DEFAULT_INITIAL_BACKOFF_MILLIS = 500L; // 0.5s + private static final long DEFAULT_MAX_BACKOFF_MILLIS = 30_000L; // 30s + private static final long DEFAULT_MAX_ELAPSED_MILLIS = 5 * 60_000L; // 10m + private static final double DEFAULT_BACKOFF_MULTIPLE = 1.5; + + // internal values per instance + private final long initialBackoffMillis; + private final long maxBackoffMillis; + private final long maxElapsedMillis; + private final double backoffMultiple; + + // internal state + private long nextWait; + private long totalElapsed; + private final boolean limitBackoffMillis; + private final boolean checkElapsedTime; + + /** + * Construct a new instance. + * @param builder the Builder to use for configuring this BackoffStrategy + */ + private ExponentialBackoffStrategy(Builder builder) { + this.initialBackoffMillis = builder.initialBackoffMillis; + this.maxBackoffMillis = builder.maxBackoffMillis; + this.maxElapsedMillis = builder.maxElapsedMillis; + this.backoffMultiple = builder.backoffMultiple; + + if(maxBackoffMillis == -1) { + limitBackoffMillis = false; + } else { + limitBackoffMillis = true; + } + + if(maxElapsedMillis == -1) { + checkElapsedTime = false; + } else { + checkElapsedTime = true; + } + + reset(); + } + + + @Override + public long nextWaitMillis() throws IllegalStateException { + // check if we have exceeded the allowed maximum elapsed time + if(checkElapsedTime && totalElapsed > maxElapsedMillis) { + return BackoffStrategy.STOP; + } + + long waitMillis = nextWait; + + // calculate the next wait milliseconds + nextWait = Math.round(nextWait * backoffMultiple); + + // don't exceed the allowed maximum wait milliseconds + // if a maximum was configured + if(limitBackoffMillis && nextWait > maxBackoffMillis) { + nextWait = maxBackoffMillis; + } + + // track total elapsed time, even if we don't wait we have to assume + // that some amount of time passed outside of the wait or we'll never + // hit the elapsed time limit + totalElapsed += waitMillis != 0 ? waitMillis : 1L; + return waitMillis; + } + + @Override + public void reset() { + nextWait = this.initialBackoffMillis; + totalElapsed = 0; + } + + /** + * + * @return a new {@link Builder} instance. + */ + public static Builder builder() { + return new Builder(); + } + + /** + * Builder for instances of {@link ExponentialBackoffStrategy}. + */ + public static final class Builder { --- End diff -- I'm not sure a builder is the best way to handle this. Since I think it would be nice to have much of this be user configurable, perhaps we can just pull the config from system properties? > Quorum port binding needs exponential back-off retry > ---------------------------------------------------- > > Key: ZOOKEEPER-2849 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2849 > Project: ZooKeeper > Issue Type: Improvement > Components: quorum > Affects Versions: 3.4.6, 3.5.3 > Reporter: Brian Lininger > Assignee: Brian Lininger > Priority: Minor > > Recently we upgraded the AWS instance type we use for running out ZooKeeper > nodes, and by doing so we're intermittently hitting an issue where ZooKeeper > cannot bind to the server election port because the IP is incorrect. This is > due to name resolution in Route53 not being in sync when ZooKeeper starts on > the more powerful EC2 instances. Currently in QuorumCnxManager.Listener, we > only attempt to bind 3 times with a 1s sleep between retries, which is not > long enough. > I'm proposing to change this to follow an exponential back-off type strategy > where each failed attempt causes a longer sleep between retry attempts. This > would allow for Zookeeper to gracefully recover when the host is > misconfigured, and subsequently corrected, without requiring the process to > be restarted while also minimizing the impact to the running instance. -- This message was sent by Atlassian JIRA (v6.4.14#64029)