[GitHub] [flink] dmvk commented on a change in pull request #18689: [FLINK-21439][runtime] Exception history adaptive scheduler

GitBox Thu, 10 Feb 2022 05:23:44 -0800


dmvk commented on a change in pull request #18689:
URL: https://github.com/apache/flink/pull/18689#discussion_r803659647




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
         return logger;
     }
 
+    protected Throwable extractError(TaskExecutionStateTransition 
taskExecutionStateTransition) {
+        Throwable cause = 
taskExecutionStateTransition.getError(userCodeClassLoader);
+        if (cause == null) {
+            cause = new FlinkException("Unknown failure cause. Probably 
related to FLINK-21376.");
+        }
+        return cause;
+    }
+
+    protected Optional<ExecutionVertexID> extractExecutionVertexID(
+            TaskExecutionStateTransition taskExecutionStateTransition) {
+        return 
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());

Review comment:
       I'd like to avoid adding new methods to the execution graph. 
`ExecutionGraph#getRegisteredExecutions` should be enough to cover this use 
case.
   
   Other think would be, that we don't really expect this not to be found ever, 
so we can throw an exception right away if we don't find an entry.
   
   DefaultScheduler does the same thing, even though it's also using optional, 
it check that the option is not empty after successful update to the execution 
graph. 

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/Executing.java
##########
@@ -80,49 +89,53 @@ public void cancel() {
                 getExecutionGraph(), getExecutionGraphHandler(), 
getOperatorCoordinatorHandler());
     }
 
+    private void handleFailure(Failure failure) {
+        failureCollection.add(failure);
+        FailureResult failureResult = context.howToHandleFailure(failure);
+        transitionOnFailure(failureResult);
+    }
+
     @Override
     public void handleGlobalFailure(Throwable cause) {
-        handleAnyFailure(cause);
+        handleFailure(Failure.createGlobal(cause));
     }
 
-    private void handleAnyFailure(Throwable cause) {
-        final FailureResult failureResult = context.howToHandleFailure(cause);
+    @Override
+    boolean updateTaskExecutionState(TaskExecutionStateTransition 
taskExecutionStateTransition) {
+        final boolean successfulUpdate =
+                getExecutionGraph().updateState(taskExecutionStateTransition);
+
+        if (successfulUpdate
+                && taskExecutionStateTransition.getExecutionState() == 
ExecutionState.FAILED) {
+            handleFailure(
+                    Failure.createLocal(
+                            extractError(taskExecutionStateTransition),
+                            
extractExecutionVertexID(taskExecutionStateTransition)));
+        }
+
+        return successfulUpdate;

Review comment:
       This methods is duplicated several times, basically in all states 
extending `StateWithExecutionGraph` apart from `Cancelling`.
   
   1) It feels that it could be moved up to the base class to avoid code 
duplication
   2) I think tasks can also fail when the job is cancelling (basically when 
the cancel call on the operator throws an exception). Is this correct? If yes, 
it would eliminate the need of treating the `Cancelling` state differently.

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
         return logger;
     }
 
+    protected Throwable extractError(TaskExecutionStateTransition 
taskExecutionStateTransition) {
+        Throwable cause = 
taskExecutionStateTransition.getError(userCodeClassLoader);
+        if (cause == null) {
+            cause = new FlinkException("Unknown failure cause. Probably 
related to FLINK-21376.");
+        }
+        return cause;
+    }
+
+    protected Optional<ExecutionVertexID> extractExecutionVertexID(
+            TaskExecutionStateTransition taskExecutionStateTransition) {
+        return 
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());
+    }
+
+    protected static Optional<RootExceptionHistoryEntry> convertFailures(
+            Function<ExecutionVertexID, Optional<ExecutionVertex>> lookup,

Review comment:
       This has a weird signature, why can't we simply pass an execution graph 
here?

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
         return logger;
     }
 
+    protected Throwable extractError(TaskExecutionStateTransition 
taskExecutionStateTransition) {

Review comment:
       We can get rid of this method once we unify `updateTaskExecutionState`

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/failure/Failure.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.scheduler.adaptive.failure;
+
+import org.apache.flink.runtime.executiongraph.ExecutionVertex;
+import 
org.apache.flink.runtime.scheduler.exceptionhistory.ExceptionHistoryEntry;
+import 
org.apache.flink.runtime.scheduler.exceptionhistory.RootExceptionHistoryEntry;
+import org.apache.flink.runtime.scheduler.strategy.ExecutionVertexID;
+
+import java.util.Optional;
+import java.util.Set;
+import java.util.function.Function;
+
+/** Failure object. */
+public abstract class Failure {
+    private final Throwable cause;
+    private final long timestamp;
+
+    public Failure(Throwable cause) {
+        this.cause = cause;
+        this.timestamp = System.currentTimeMillis();

Review comment:
       Not really, the DefaultlScheduler does the same thing. Unless the 
timestamp would be reported by the TM, there is not much we can do

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StopWithSavepoint.java
##########
@@ -134,30 +143,53 @@ public JobStatus getJobStatus() {
         return JobStatus.RUNNING;
     }
 
+    private void handleFailure(Failure failure) {
+        failureCollection.add(failure);
+        FailureResult failureResult = context.howToHandleFailure(failure);
+        transitionOnFailure(failureResult);
+    }
+
     @Override
     public void handleGlobalFailure(Throwable cause) {
-        handleAnyFailure(cause);
+        handleFailure(Failure.createGlobal(cause));
     }
 
     @Override
     boolean updateTaskExecutionState(TaskExecutionStateTransition 
taskExecutionStateTransition) {
         final boolean successfulUpdate =
                 getExecutionGraph().updateState(taskExecutionStateTransition);
 
-        if (successfulUpdate) {
-            if (taskExecutionStateTransition.getExecutionState() == 
ExecutionState.FAILED) {
-                Throwable cause = 
taskExecutionStateTransition.getError(userCodeClassLoader);
-                handleAnyFailure(
-                        cause == null
-                                ? new FlinkException(
-                                        "Unknown failure cause. Probably 
related to FLINK-21376.")
-                                : cause);
-            }
+        if (successfulUpdate
+                && taskExecutionStateTransition.getExecutionState() == 
ExecutionState.FAILED) {
+            handleFailure(
+                    Failure.createLocal(
+                            extractError(taskExecutionStateTransition),
+                            
extractExecutionVertexID(taskExecutionStateTransition)));
         }
 
         return successfulUpdate;
     }
 
+    private void transitionOnFailure(FailureResult failureResult) {

Review comment:
       This is duplicated in `Executing` state.

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/StateWithExecutionGraph.java
##########
@@ -148,6 +162,33 @@ public Logger getLogger() {
         return logger;
     }
 
+    protected Throwable extractError(TaskExecutionStateTransition 
taskExecutionStateTransition) {
+        Throwable cause = 
taskExecutionStateTransition.getError(userCodeClassLoader);
+        if (cause == null) {
+            cause = new FlinkException("Unknown failure cause. Probably 
related to FLINK-21376.");
+        }
+        return cause;
+    }
+
+    protected Optional<ExecutionVertexID> extractExecutionVertexID(
+            TaskExecutionStateTransition taskExecutionStateTransition) {
+        return 
executionGraph.getExecutionVertexId(taskExecutionStateTransition.getID());
+    }
+
+    protected static Optional<RootExceptionHistoryEntry> convertFailures(
+            Function<ExecutionVertexID, Optional<ExecutionVertex>> lookup,
+            List<Failure> failureCollection) {
+        if (failureCollection.isEmpty()) {
+            return Optional.empty();
+        }
+        Failure first = failureCollection.remove(0);
+        Set<ExceptionHistoryEntry> entries = new HashSet<>();
+        for (Failure failure : failureCollection) {
+            entries.add(failure.toExceptionHistoryEntry(lookup));

Review comment:
       Wouldn't simply implementing hashCode & equals for the Failure object do 
the trick?

##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java
##########
@@ -1576,4 +1576,15 @@ public ExecutionDeploymentListener 
getExecutionDeploymentListener() {
     public boolean isDynamic() {
         return isDynamic;
     }
+
+    @Override
+    public Optional<ExecutionVertexID> getExecutionVertexId(ExecutionAttemptID 
id) {
+        Execution execution = this.getRegisteredExecutions().get(id);
+        return 
Optional.ofNullable(execution).map(Execution::getVertex).map(ExecutionVertex::getID);
+    }
+
+    @Override
+    public Optional<ExecutionVertex> getExecutionVertex(final 
ExecutionVertexID executionVertexId) {

Review comment:
       We already have `DefaultExecutionGraph#getExecutionVertexOrThrow` in 
place, so we should avoid adding a new method




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] dmvk commented on a change in pull request #18689: [FLINK-21439][runtime] Exception history adaptive scheduler

Reply via email to