XComp commented on code in PR #21916: URL: https://github.com/apache/flink/pull/21916#discussion_r1104505322
########## flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java: ########## @@ -58,11 +59,9 @@ void testJobCancelledIfClientHeartbeatTimeout() throws Exception { JobClient jobClient = submitJob(createConfiguration(true)); // The client doesn't report heartbeat to the dispatcher. - - assertThat(jobClient.getJobExecutionResult()) - .failsWithin(Duration.ofSeconds(1)) - .withThrowableOfType(ExecutionException.class) - .withMessageContaining("Job was cancelled"); + assertThatThrownBy(() -> jobClient.getJobExecutionResult().get()) Review Comment: I see your point, @reswqa . But I'm not 100% that this is what we want here. The heartbeat interval should have an upper limit as far as I can see. Waiting forever might hide some issues. What we have here is a heartbeat handshake that is configured through `Dispatcher.CLIENT_ALIVENESS_CHECK_DURATION` in [ClientHeartbeatTest:96](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L96). The test has two configuration parameters: * `ClientHeartbeatTest.clientHeartbeatInterval` which (based on the name) defines the interval in which the heartbeat expiration check is runs (see Dispatcher:416](https://github.com/apache/flink/blob/2ad9585f47fe7fa3dcad286194bdcc4dcd131712/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L416)). This is set to 50ms within the test. * `ClientHeartbeatTest.clientHeartbeatTimeout` in contrast, defines the timeout for a client heartbeat which triggers the job cancellation in [Dispatcher:1113](https://github.com/apache/flink/blob/2ad9585f47fe7fa3dcad286194bdcc4dcd131712/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L1113). The former one is defined for the cluster (see [ClientHeartbeatTest:96](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L96)) whereas the latter one is set within the JobGraph (see [ClientHeartbeatTest:119](https://github.com/apache/flink/blob/294159b3b585b0f2e6daf2d243be42b7ec7dd90c/flink-clients/src/test/java/org/apache/flink/client/ClientHeartbeatTest.java#L119)). You'll see that the test is actually setting `clientHeartbeatTimeout` (=500ms) in both cases which seems to be wrong resulting in a worst-case scenario of 1000ms (i.e. 500ms + 500ms) until the timeout is detected in the job is cancelled. In contrast, what we might want to do is setting the `CLIENT_ALIVENESS_CHECK_DURATION` to `ClientHeartbeatTest.clientHeartbeatInterval` instead which leads to the job being cancelled after 550ms (500ms + 50ms) in the worst case which makes the 1 second wait time within the test more reasonable again. WDYT? @xintongsong can you confirm my conclusion or am I missing something? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org