[GitHub] [flink] HuangZhenQiu commented on a change in pull request #8952: [FLINK-10868][flink-runtime] Add failure rater for resource manager

GitBox Fri, 25 Dec 2020 19:55:48 -0800


HuangZhenQiu commented on a change in pull request #8952:
URL: https://github.com/apache/flink/pull/8952#discussion_r548939375




##########
File path: 
flink-core/src/main/java/org/apache/flink/configuration/ResourceManagerOptions.java
##########
@@ -67,6 +69,33 @@
                        "for streaming workloads, which may fail if there are 
not enough slots. Note that this configuration option does not take " +
                        "effect for standalone clusters, where how many slots 
are allocated is not controlled by Flink.");
 
+       /**
+        * Defines the maximum number of worker (YARN / Mesos / Kubernetes) 
failures per minute before rejecting subsequent worker
+        * requests until the failure rate falls below the maximum. It is to 
quickly catch external dependency caused
+        * workers failure and wait for retry interval before sending new 
request. By default, the value is set to 10/min.
+        */
+       public static final ConfigOption<Double> MAXIMUM_WORKERS_FAILURE_RATE = 
ConfigOptions
+               .key("resourcemanager.start-worker.max-failure-rate")
+               .doubleType()
+               .defaultValue(10.0)
+               .withDescription("Defines the maximum number of worker (YARN / 
Mesos) failures per minute before rejecting" +
+                       " subsequent worker requests until the failure rate 
falls below the maximum. It is to quickly catch" +
+                       " external dependency caused workers failure and 
terminate job accordingly." +
+                       " By default, the value is set to 10/min.");
+
+       /**
+        * Defines the worker creation interval in milliseconds. In case of 
worker creation failures, we should wait for an interval before
+        * trying to create new workers when the failure rate exceeds. 
Otherwise, ActiveResourceManager will always re-requesting
+        * the worker, which keeps the main thread busy.
+        */
+       public static final ConfigOption<Duration> 
WORKER_CREATION_RETRY_INTERVAL = ConfigOptions
+               .key("resourcemanager.start-worker.retry-interval")
+               .durationType()
+               .defaultValue(Duration.ofMillis(30))

Review comment:
       Updated to 3 seconds as the default value of original Kubernetes 
interval.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] HuangZhenQiu commented on a change in pull request #8952: [FLINK-10868][flink-runtime] Add failure rater for resource manager

Reply via email to