[ https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836395#comment-17836395 ]
yuanfenghu edited comment on FLINK-35035 at 4/12/24 6:08 AM: ------------------------------------------------------------- [~echauchot] Thank you for your reply. I think you are looking at this scene from the perspective of Reactive Mode, because Reactive Mode only uses the resources of the cluster as a criterion for task parallelism. I don’t know if I understand it correctly. But my above scenario is in non-Reactive Mode. I just use the adaptive scheduler, which means that I increase the parallelism of the running task from 10 to 12. However, because min-parallelism-increase=5, I am satisfied in the cluster slot. When the condition of 12 is met, the expansion of the task cannot be triggered immediately, but it needs to wait for scaling-interval.max before the expansion can be triggered. My purpose is to trigger the expansion when the parallelism of 12 is met, instead of having to after scaling-interval.max or min-parallelism-increase was (Author: JIRAUSER296932): [~echauchot] Thank you for your reply. You should look at this issue from the perspective of Reactive Mode, because Reactive Mode only uses the resources of the cluster as a criterion for task parallelism. I don’t know if I understand it correctly. But my above scenario is in non-Reactive Mode. But I use the adaptive scheduler, which means that I increase the parallelism of the running task from 10 to 12. However, because min-parallelism-increase=5, I am satisfied in the cluster slot. When the condition of 12 is met, the expansion of the task cannot be triggered immediately, but it needs to wait for scaling-interval.max before the expansion can be triggered. My purpose is to trigger the expansion when the parallelism of 12 is met, instead of having to after scaling-interval.max > Reduce job pause time when cluster resources are expanded in adaptive mode > -------------------------------------------------------------------------- > > Key: FLINK-35035 > URL: https://issues.apache.org/jira/browse/FLINK-35035 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task > Affects Versions: 1.19.0 > Reporter: yuanfenghu > Priority: Minor > > When 'jobmanager.scheduler = adaptive' , job graph changes triggered by > cluster expansion will cause long-term task stagnation. We should reduce this > impact. > As an example: > I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)] > When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5] > When I add slots the task will trigger jobgraph changes,by > org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable, > However, the five new slots I added were not discovered at the same time (for > convenience, I assume that a taskmanager has one slot), because no matter > what environment we add, we cannot guarantee that the new slots will be added > at once, so this will cause onNewResourcesAvailable triggers repeatedly > ,If each new slot action has a certain interval, then the jobgraph will > continue to change during this period. What I hope is that there will be a > stable time to configure the cluster resources and then go to it after the > number of cluster slots has been stable for a certain period of time. Trigger > jobgraph changes to avoid this situation -- This message was sent by Atlassian Jira (v8.20.10#820010)