1996fanrui commented on code in PR #24340: URL: https://github.com/apache/flink/pull/24340#discussion_r1495400609
########## flink-tests/src/test/java/org/apache/flink/test/checkpointing/AutoRescalingITCase.java: ########## @@ -341,7 +343,7 @@ public void testCheckpointRescalingNonPartitionedStateCausesException() throws E restClusterClient.updateJobResourceRequirements(jobID, builder.build()).join(); - waitForRunningTasks(restClusterClient, jobID, parallelism2); + waitForRunningTasks(restClusterClient, jobID, 2 * parallelism2); Review Comment: `parallelism2` is the task number for before rescale, and `2 * parallelism2` is the task number for after rescale. Ideally, we should wait for all tasks after rescale to become running. That's exactly what FLINK-34336 fix. > Why does it increase the hang possibility? When the rescale with cooldown=30s, the rescale action will be executed after 30s. When the `waitForRunningTasks` is called, the job is still running with old parallelism, so the `waitForRunningTasks` can be done. When we disable the scaling cooldown, the rescale may happen immediately. When it happens, the job will be running with `2 * parallelism2`, so `waitForRunningTasks` with old parallelism2 will be hang forever. > And why did we merge it into master? IIUC, we should wait for all tasks after rescale to become running. But these 2 callers wait for all tasks before rescale. These 2 wait are unexpected, so I think it's a separate bug regardless of we disable/enable cooldown. It means we should wait for all tasks after rescale even if we enable cooldown=30s. That's why I create a separate JIRA (FLINK-34336) to follow it. Why do I merge it into master? 1. I think `Disable the scaling cooldown to speed up the test` and FLINK-34336 are separate 2. I guess `Disable the scaling cooldown to speed up the test` may be cause AutoRescalingITCase hang, so I prepare https://github.com/apache/flink/pull/24248 in advanced. It's easy to be merged to fix the hang. > Why do we merge it then now? 1.19 will be released soon, I'm afraid to affect this release. So I merge it in master(1.20) first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org