[ https://issues.apache.org/jira/browse/FLINK-26773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534616#comment-17534616 ]
Jonathan Lazarus commented on FLINK-26773: ------------------------------------------ I have a solution, involving inserting a global flag in the JobMaster that I would like to contribute. Can I please be assigned this issue? > ResourceManager leader election can a reconnect while shutting down the > JobMaster > --------------------------------------------------------------------------------- > > Key: FLINK-26773 > URL: https://issues.apache.org/jira/browse/FLINK-26773 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.15.0, 1.14.4, 1.16.0 > Reporter: Matthias Pohl > Priority: Major > Attachments: FLINK-26773.failure-during-shutdown.log > > > There's a race condition happening with the {{ResourceManager}} leader > election in the {{JobMaster}} while shutting it down. The {{JobMaster}} calls > {{dissolveResourceManagerConnection}} while shutting down itself trying to > disconnect itself from the {{ResourceManager}} (see > [JobMaster:1180|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1180]). > This closes the RM connection to the {{JobMaster}} from the > {{ResourceManager}}'s side (see > [ResourceManager:979|https://github.com/apache/flink/blob/9055279d0286f4374694325250a45dc1c60301a7/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L979]. > The {{JobMaster}} tries to reconnect to the {{ResourceManager}} leader if > there's still an address stored for that leader (which is the case during > shutdown; see > [JobMaster:790|https://github.com/apache/flink/blob/fdb80108a3c0e4fb12dbc3f89ecb2327d97deebf/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L790]). > The {{JobMaster}} shouldn't try to reconnect after it has already freed it's > requirements as part of the shutdown. -- This message was sent by Atlassian Jira (v8.20.7#820007)