XComp commented on code in PR #21916:
URL: https://github.com/apache/flink/pull/21916#discussion_r1104505322


##########
flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java:
##########
@@ -58,11 +59,9 @@ void testJobCancelledIfClientHeartbeatTimeout() throws 
Exception {
         JobClient jobClient = submitJob(createConfiguration(true));
 
         // The client doesn't report heartbeat to the dispatcher.
-
-        assertThat(jobClient.getJobExecutionResult())
-                .failsWithin(Duration.ofSeconds(1))
-                .withThrowableOfType(ExecutionException.class)
-                .withMessageContaining("Job was cancelled");
+        assertThatThrownBy(() -> jobClient.getJobExecutionResult().get())

Review Comment:
   I see your point, @reswqa . But I'm not 100% that this is what we want here. 
The heartbeat interval should have an upper limit as far as I can see. Waiting 
forever might hide some issues. What we have here is a heartbeat handshake that 
is configured through `Dispatcher.CLIENT_ALIVENESS_CHECK_DURATION` in 
[ClientHeartbeatTest:96](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L96).
   
   The test has two configuration parameters: 
   * `ClientHeartbeatTest.clientHeartbeatInterval` which (based on the name) 
defines the interval in which the heartbeat expiration check is runs (see 
Dispatcher:416](https://github.com/apache/flink/blob/2ad9585f47fe7fa3dcad286194bdcc4dcd131712/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L416)).
 This is set to 50ms within the test.
   * `ClientHeartbeatTest.clientHeartbeatTimeout` in contrast, defines the 
timeout for a client heartbeat which triggers the job cancellation in 
[Dispatcher:1113](https://github.com/apache/flink/blob/2ad9585f47fe7fa3dcad286194bdcc4dcd131712/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L1113).
 
   
   The former one is defined for the cluster (see 
[ClientHeartbeatTest:96](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L96))
 whereas the latter one is set within the JobGraph (see 
[ClientHeartbeatTest:119](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L119)).
 You'll see that the test is actually setting `clientHeartbeatTimeout` (=500ms) 
in both cases which seems to be wrong resulting in a worst-case scenario of 
1000ms (i.e. 500ms + 500ms) until the timeout is detected in the job is 
cancelled.
   
   In contrast, what we might want to do is setting the 
`CLIENT_ALIVENESS_CHECK_DURATION` to 
`ClientHeartbeatTest.clientHeartbeatInterval` instead which leads to the job 
being cancelled after 550ms (500ms + 50ms) in the worst case which makes the 1 
second wait time within the test more reasonable again. WDYT?
   
   @xintongsong can you confirm my conclusion or am I missing something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to