HuangZhenQiu commented on a change in pull request #8952:
URL: https://github.com/apache/flink/pull/8952#discussion_r550421330



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java
##########
@@ -213,14 +235,33 @@ public void 
onPreviousAttemptWorkersRecovered(Collection<WorkerType> recoveredWo
         }
     }
 
+    /**
+     * Record failure number of worker in ResourceManagers. Return whether 
maximum failure rate is
+     * detected.
+     *
+     * @return whether should acquire new container/worker after the a stop 
interval
+     */
+    public boolean recordWorkerFailure() {
+        failureRater.markEvent();
+
+        try {
+            failureRater.checkAgainstThreshold();
+        } catch (ThresholdExceedException e) {
+            log.warn(e.getMessage() + " in resource manager failure rater.");
+            return true;
+        }
+
+        return false;
+    }
+
     @Override
     public void onWorkerTerminated(ResourceID resourceId, String diagnostics) {
         if (clearStateForWorker(resourceId)) {
             log.info(
                     "Worker {} is terminated. Diagnostics: {}",
                     resourceId.getStringWithMetadata(),
                     diagnostics);
-            requestWorkerIfRequired();
+            recordWorkerFailureAndPauseWorkerCreationIfNeeded();

Review comment:
       Yes, we should record failure only for 
currentAttemptUnregisteredWorkers. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to