[ https://issues.apache.org/jira/browse/FLINK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039875#comment-17039875 ]
YufeiLiu commented on FLINK-15959: ---------------------------------- [~xintongsong] I'm worried about the uncertainty, lost a TM won't make much difference, but the situation could get worse after a few times restart. If source operator parallelism is less than others, the source tasks are tend to converge on a few TMs after several times TM lost. I think combine the function of {{SlotPool}} and {{ResourceManager}} is a good idea for the long term, and these config will work in any case if JobMaster didn't caching slot. > Add min/max number of slots configuration to limit total number of slots > ------------------------------------------------------------------------ > > Key: FLINK-15959 > URL: https://issues.apache.org/jira/browse/FLINK-15959 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 1.11.0 > Reporter: YufeiLiu > Priority: Major > > Flink removed `-n` option after FLIP-6, change to ResourceManager start a new > worker when required. But I think maintain a certain amount of slots is > necessary. These workers will start immediately when ResourceManager starts > and would not release even if all slots are free. > Here are some resons: > # Users actually know how many resources are needed when run a single job, > initialize all workers when cluster starts can speed up startup process. > # Job schedule in topology order, next operator won't schedule until prior > execution slot allocated. The TaskExecutors will start in several batchs in > some cases, it might slow down the startup speed. > # Flink support > [FLINK-12122|https://issues.apache.org/jira/browse/FLINK-12122] [Spread out > tasks evenly across all available registered TaskManagers], but it will only > effect if all TMs are registered. Start all TMs at begining can slove this > problem. > *suggestion:* > * Add config "taskmanager.minimum.numberOfTotalSlots" and > "taskmanager.maximum.numberOfTotalSlots", default behavior is still like > before. > * Start plenty number of workers to satisfy minimum slots when > ResourceManager accept leadership(subtract recovered workers). > * Don't comlete slot request until minimum number of slots are registered, > and throw exeception when exceed maximum. -- This message was sent by Atlassian Jira (v8.3.4#803005)