Liu created FLINK-24174:
---------------------------

             Summary: MiniClusterTestEnvironment‘s triggerTaskManagerFailover 
may stuck in CommonTestUtils.waitForJobStatus()
                 Key: FLINK-24174
                 URL: https://issues.apache.org/jira/browse/FLINK-24174
             Project: Flink
          Issue Type: Improvement
          Components: Test Infrastructure
            Reporter: Liu


When writing taskmanager failover tests with [unified testing framework for 
connectors|https://issues.apache.org/jira/browse/FLINK-19554], I find that it 
may stuck in 

CommonTestUtils.waitForJobStatus() as following:
 # triggerTaskManagerFailover is called.
 # JobStatus switched from RUNNING to RESTARTING.
 # JobStatus switched from RESTARTING to RUNNING.
 # The method terminateTaskManager() is completed.
 # Since the jobStatus is RUNNING, CommonTestUtils.waitForJobStatus() will 
never exit.

A solution is to call terminateTaskManager() with async way. At the same time, 
call 

CommonTestUtils.waitForJobStatus(). The pseudo code can be as follow:
{code:java}
public void triggerTaskManagerFailover(JobClient jobClient, Runnable 
afterFailAction)
        throws Exception {
    CompletableFuture<Void> completableFuture = terminateTaskManager();
    CommonTestUtils.waitForJobStatus(
            jobClient,
            Arrays.asList(JobStatus.FAILING, JobStatus.FAILED, 
JobStatus.RESTARTING),
            Deadline.fromNow(Duration.ofMinutes(5)));
    completableFuture.get();
    afterFailAction.run();
    startTaskManager();
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to