[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252039395 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java ## @@ -626,6 +686,44 @@ public void unRegisterInfoMessageListener(final String address) { } } + protected void rejectAllPendingSlotRequests(Exception e) { + slotManager.rejectAllPendingSlotRequests(e); + } + + protected synchronized void recordFailure() { + if (!checkFailureRate) { + return; + } + if (isFailureTimestampFull()) { + taskExecutorFailureTimestamps.remove(); + } + taskExecutorFailureTimestamps.add(System.currentTimeMillis()); + } + + protected boolean shouldRejectRequests() { Review comment: the rate calculation logic here share a lot with FailureRateRestartStrategy. Can we refactor the rate calculation code to a common class? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252038865 ## File path: flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java ## @@ -192,7 +235,17 @@ public ResourceManager( this.jobManagerRegistrations = new HashMap<>(4); this.jmResourceIdRegistrations = new HashMap<>(4); this.taskExecutors = new HashMap<>(8); - infoMessageListeners = new ConcurrentHashMap<>(8); + this.infoMessageListeners = new ConcurrentHashMap<>(8); + this.failureInterval = failureInterval; + this.maximumFailureTaskExecutorPerInternal = maxFailurePerInterval; + + if (maximumFailureTaskExecutorPerInternal > 0) { + this.taskExecutorFailureTimestamps = new ArrayDeque<>(maximumFailureTaskExecutorPerInternal); Review comment: How about 0? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252025408 ## File path: flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java ## @@ -83,8 +83,27 @@ */ public static final ConfigOption MAX_FAILED_CONTAINERS = key("yarn.maximum-failed-containers") - .noDefaultValue() - .withDescription("Maximum number of containers the system is going to reallocate in case of a failure."); + .noDefaultValue() + .withDescription("Maximum number of containers the system is going to reallocate in case of a failure."); + + /** +* The maximum number of failed YARN containers within an interval before entirely stopping +* the YARN session / job on YARN. +* By default, the value is -1 +*/ + public static final ConfigOption MAX_FAILED_CONTAINERS_PER_INTERVAL = + key("yarn.maximum-failed-containers-per-interval") + .defaultValue(-1) + .withDescription("Maximum number of containers the system is going to reallocate in case of a failure in an interval."); Review comment: Please document what does -1 mean. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252025274 ## File path: docs/_includes/generated/mesos_configuration.html ## @@ -27,6 +27,11 @@ -1 The maximum number of failed workers before the cluster fails. May be set to -1 to disable this feature. This option is ignored unless Flink is in legacy mode. + +mesos.maximum-failed-workers-per-interval +-1 +Maximum number of workers the system is going to reallocate in case of a failure in an interval. Review comment: Please document what does -1 & 0 mean. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252024051 ## File path: flink-mesos/src/main/java/org/apache/flink/mesos/configuration/MesosOptions.java ## @@ -99,6 +99,25 @@ .withDescription("The config parameter defining the Mesos artifact server port to use. Setting the port to" + " 0 will let the OS choose an available port."); + /** +* The maximum number of failed Mesos worker within an interval before entirely stopping +* the Mesos session / job on Mesos. +* By default, the value is -1 +*/ + public static final ConfigOption MAX_FAILED_WORKERS_PER_INTERVAL = + key("mesos.maximum-failed-workers-per-interval") + .defaultValue(-1) + .withDescription("Maximum number of workers the system is going to reallocate in case of a failure in an interval."); + + /** +* The interval for measuring failure rate of containers in second unit. +* By default, the value is 5 minutes. +**/ + public static final ConfigOption WORKERS_FAILURE_RATE_INTERVAL = + key("mesos.workers-failure-rate-interval") + .defaultValue(300) + .withDeprecatedKeys("The interval for measuring failure rate of workers"); Review comment: withDescription here as well. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r252037753 ## File path: docs/_includes/generated/yarn_config_configuration.html ## @@ -42,6 +47,11 @@ (none) Maximum number of containers the system is going to reallocate in case of a failure. + +yarn.maximum-failed-containers-per-interval +-1 Review comment: Please document what does -1 and 0 mean. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers URL: https://github.com/apache/flink/pull/7356#discussion_r251657123 ## File path: flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java ## @@ -83,8 +83,27 @@ */ public static final ConfigOption MAX_FAILED_CONTAINERS = key("yarn.maximum-failed-containers") - .noDefaultValue() - .withDescription("Maximum number of containers the system is going to reallocate in case of a failure."); + .noDefaultValue() + .withDescription("Maximum number of containers the system is going to reallocate in case of a failure."); + + /** +* The maximum number of failed YARN containers within an interval before entirely stopping +* the YARN session / job on YARN. +* By default, the value is -1 +*/ + public static final ConfigOption MAX_FAILED_CONTAINERS_PER_INTERVAL = + key("yarn.maximum-failed-containers-per-interval") + .defaultValue(-1) + .withDescription("Maximum number of containers the system is going to reallocate in case of a failure in an interval."); + + /** +* The interval for measuring failure rate of containers in second unit. +* By default, the value is 5 minutes. +**/ + public static final ConfigOption CONTAINERS_FAILURE_RATE_INTERVAL = + key("yarn.containers-failure-rate-interval") + .defaultValue(300) + .withDeprecatedKeys("The interval for measuring failure rate of containers"); Review comment: Should be withDescription here, not withDeprecatedKeys. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services