[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252039395
 
 

 ##
 File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java
 ##
 @@ -626,6 +686,44 @@ public void unRegisterInfoMessageListener(final String 
address) {
}
}
 
+   protected void rejectAllPendingSlotRequests(Exception e) {
+   slotManager.rejectAllPendingSlotRequests(e);
+   }
+
+   protected synchronized void recordFailure() {
+   if (!checkFailureRate) {
+   return;
+   }
+   if (isFailureTimestampFull()) {
+   taskExecutorFailureTimestamps.remove();
+   }
+   taskExecutorFailureTimestamps.add(System.currentTimeMillis());
+   }
+
+   protected boolean shouldRejectRequests() {
 
 Review comment:
   the rate calculation logic here share a lot with FailureRateRestartStrategy. 
Can we refactor the rate calculation code to a common class?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252038865
 
 

 ##
 File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java
 ##
 @@ -192,7 +235,17 @@ public ResourceManager(
this.jobManagerRegistrations = new HashMap<>(4);
this.jmResourceIdRegistrations = new HashMap<>(4);
this.taskExecutors = new HashMap<>(8);
-   infoMessageListeners = new ConcurrentHashMap<>(8);
+   this.infoMessageListeners = new ConcurrentHashMap<>(8);
+   this.failureInterval = failureInterval;
+   this.maximumFailureTaskExecutorPerInternal = 
maxFailurePerInterval;
+
+   if (maximumFailureTaskExecutorPerInternal > 0) {
+   this.taskExecutorFailureTimestamps = new 
ArrayDeque<>(maximumFailureTaskExecutorPerInternal);
 
 Review comment:
   How about 0? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252025408
 
 

 ##
 File path: 
flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java
 ##
 @@ -83,8 +83,27 @@
 */
public static final ConfigOption MAX_FAILED_CONTAINERS =
key("yarn.maximum-failed-containers")
-   .noDefaultValue()
-   .withDescription("Maximum number of containers the system is 
going to reallocate in case of a failure.");
+   .noDefaultValue()
+   .withDescription("Maximum number of containers the 
system is going to reallocate in case of a failure.");
+
+   /**
+* The maximum number of failed YARN containers within an interval 
before entirely stopping
+* the YARN session / job on YARN.
+* By default, the value is -1
+*/
+   public static final ConfigOption 
MAX_FAILED_CONTAINERS_PER_INTERVAL =
+   key("yarn.maximum-failed-containers-per-interval")
+   .defaultValue(-1)
+   .withDescription("Maximum number of containers the system is 
going to reallocate in case of a failure in an interval.");
 
 Review comment:
   Please document what does -1 mean.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252025274
 
 

 ##
 File path: docs/_includes/generated/mesos_configuration.html
 ##
 @@ -27,6 +27,11 @@
 -1
 The maximum number of failed workers before the cluster fails. 
May be set to -1 to disable this feature. This option is ignored unless Flink 
is in legacy mode.
 
+
+mesos.maximum-failed-workers-per-interval
+-1
+Maximum number of workers the system is going to reallocate in 
case of a failure in an interval.
 
 Review comment:
   Please document what does -1 & 0 mean.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252024051
 
 

 ##
 File path: 
flink-mesos/src/main/java/org/apache/flink/mesos/configuration/MesosOptions.java
 ##
 @@ -99,6 +99,25 @@
.withDescription("The config parameter defining the 
Mesos artifact server port to use. Setting the port to" +
" 0 will let the OS choose an available port.");
 
+   /**
+* The maximum number of failed Mesos worker within an interval before 
entirely stopping
+* the Mesos session / job on Mesos.
+* By default, the value is -1
+*/
+   public static final ConfigOption 
MAX_FAILED_WORKERS_PER_INTERVAL =
+   key("mesos.maximum-failed-workers-per-interval")
+   .defaultValue(-1)
+   .withDescription("Maximum number of workers the system 
is going to reallocate in case of a failure in an interval.");
+
+   /**
+* The interval for measuring failure rate of containers in second unit.
+* By default, the value is 5 minutes.
+**/
+   public static final ConfigOption WORKERS_FAILURE_RATE_INTERVAL 
=
+   key("mesos.workers-failure-rate-interval")
+   .defaultValue(300)
+   .withDeprecatedKeys("The interval for measuring failure 
rate of workers");
 
 Review comment:
   withDescription here as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-29 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r252037753
 
 

 ##
 File path: docs/_includes/generated/yarn_config_configuration.html
 ##
 @@ -42,6 +47,11 @@
 (none)
 Maximum number of containers the system is going to reallocate 
in case of a failure.
 
+
+yarn.maximum-failed-containers-per-interval
+-1
 
 Review comment:
   Please document what does -1 and 0 mean.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] Enforce maximum TMs failure rate in ResourceManagers

2019-01-28 Thread GitBox
suez1224 commented on a change in pull request #7356: [FLINK-10868][flink-yarn] 
Enforce maximum TMs failure rate in ResourceManagers
URL: https://github.com/apache/flink/pull/7356#discussion_r251657123
 
 

 ##
 File path: 
flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java
 ##
 @@ -83,8 +83,27 @@
 */
public static final ConfigOption MAX_FAILED_CONTAINERS =
key("yarn.maximum-failed-containers")
-   .noDefaultValue()
-   .withDescription("Maximum number of containers the system is 
going to reallocate in case of a failure.");
+   .noDefaultValue()
+   .withDescription("Maximum number of containers the 
system is going to reallocate in case of a failure.");
+
+   /**
+* The maximum number of failed YARN containers within an interval 
before entirely stopping
+* the YARN session / job on YARN.
+* By default, the value is -1
+*/
+   public static final ConfigOption 
MAX_FAILED_CONTAINERS_PER_INTERVAL =
+   key("yarn.maximum-failed-containers-per-interval")
+   .defaultValue(-1)
+   .withDescription("Maximum number of containers the system is 
going to reallocate in case of a failure in an interval.");
+
+   /**
+* The interval for measuring failure rate of containers in second unit.
+* By default, the value is 5 minutes.
+**/
+   public static final ConfigOption 
CONTAINERS_FAILURE_RATE_INTERVAL =
+   key("yarn.containers-failure-rate-interval")
+   .defaultValue(300)
+   .withDeprecatedKeys("The interval for measuring failure rate of 
containers");
 
 Review comment:
   Should be withDescription here, not withDeprecatedKeys.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services