liuml07 opened a new pull request, #27104:
URL: https://github.com/apache/flink/pull/27104

   ## What is the purpose of the change
   
   https://issues.apache.org/jira/browse/FLINK-38499
   
   Currently, the Curator framework used by ZK based HA is using the 
exponential backoff retry policy. However, the max sleep time is unbounded. 
That could cause unbounded sleep time when the retryCount is large. When that 
happens, recovery from ZK issues may be unreasonably slow.
   
   In my day job, we have a critical patch that limits the max sleep time after 
seeing multiple ZK issues in the past. In other Apache projects, the 
BoundedExponentialBackoffRetry is widely used, such as fluss, druid, hudi, 
bookeeper, phoeniex to name a few.
   
   This Jira proposes to limit the max sleep time by leveraging 
BoundedExponentialBackoffRetry, with a pretty high default value for starters. 
Users can change this via a new config option.
   
   ## Brief change log
   
   1. Added new configuration option for HA:
     - Key: `high-availability.zookeeper.client.max-retry-wait`
     - Type: Duration
     - Default: 30 seconds (30000ms)
     - Description: Caps exponential backoff to prevent excessively long waits 
between retries
   2. Updated retry policy in `ZooKeeperUtils`
   3. Updated test files to use the new retry policy
   
   ## Verifying this change
   
   Updated existing tests. Ported from internally tested patch.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (**yes** / no / don't 
know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / docs / 
**JavaDocs** / not documented)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to