zentol commented on a change in pull request #14963:
URL: https://github.com/apache/flink/pull/14963#discussion_r579108209



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/declarative/DeclarativeScheduler.java
##########
@@ -907,20 +909,37 @@ public void runIfState(State expectedState, Runnable 
action, Duration delay) {
 
     // ----------------------------------------------------------------
 
+    /** Note: Do not call this method from a State constructor. */
     @VisibleForTesting
-    void transitionToState(State newState) {
-        if (state != newState) {
-            LOG.debug(
-                    "Transition from state {} to {}.",
-                    state.getClass().getSimpleName(),
-                    newState.getClass().getSimpleName());
-
-            State oldState = state;
-            oldState.onLeave(newState.getClass());
-
-            state = newState;
-            newState.onEnter();
-        }
+    <S extends State> void transitionToState(StateFactory<S> targetState) {
+        Preconditions.checkState(
+                state != null, "State transitions are now allowed while 
construcing a state.");
+        Preconditions.checkState(
+                state.getClass() != targetState.getStateClass(),
+                "Attempted to transition into the very state the scheduler is 
already in.");
+
+        LOG.debug(
+                "Transition from state {} to {}.",
+                state.getClass().getSimpleName(),
+                targetState.getStateClass().getSimpleName());
+
+        State oldState = state;
+        oldState.onLeave(targetState.getStateClass());
+
+        // Guard against state transitions while constructing state objects.
+        //
+        // Consider the following scenario:
+        // Scheduler is in state Restarting, once the cancellation is 
complete, we enter the
+        // transitionToState(WaitingForResources) method.
+        // In the constructor of WaitingForResources, we call 
`notifyNewResourcesAvailable()`, which
+        // finds resources and enters transitionsToState(Executing). We are in 
state Executing. Then
+        // we return from the methods and go back in our call stack to the
+        // transitionToState(WaitingForResources) call, where we overwrite 
Executing with
+        // WaitingForResources. And there we have it, a deployed execution 
graph, and a scheduler
+        // that is in WaitingForResources.
+        state = null;

Review comment:
       "The issues are fixed now; we don't need safeguards anymore" could just 
as well be used as an argument to keep the PR as is and even remove state 
transitions check. We fixed the one problematic case and could call it a day.
   
   Overall, my impression is that we should not allow immediate state 
transitions in the constructor, `onEnter`, or `onLeave`, in any case. Because 
all of these result in weird loops/interleaving of state transitions that can 
lead to subtle issues.
   IOW, `transitionToState` should be an atomic operation that fully completes 
before another transition can be triggered. Any attempt at triggering a state 
transition will fail hard.
   
   Hence, whether `onEnter` exists or not is actually not relevant in this 
consideration. While it was the sole case where this issue occurred, the 
underlying issues are unclear and unenforced contracts as to what a State is 
allowed to do in which methods.
   
   As you have shown in the PR it is pretty easy to safeguard against such 
occurrences; you'd just need to null the state before calling onLeave.
   
   Alternatively, this would also work, and would be a bit more consolidated:
   ```
   oldState = state
   state = null;
   oldState.onLeave()
   newState = targetState.getState()
   newState.onEnter()
   Preconditions.checkState(state == null); // this will fail if any other 
state transition occurred in the mean time
   state = newState;
   ```
   
   And as far as I'm concerned that's a pretty tiny cost compared to the risk 
of testing entirely theoretical scenarios or breaking the state machine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to