[ https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554582#comment-17554582 ]
Xintong Song commented on FLINK-27420: -------------------------------------- [~danderson], This has already been fixed for: - master (1.16.0): 525e3170c622da65becfab3e8afe97303b07b7db - 1.15.1: fbc8e460cd0f80ecb4387855ab982d896e95af3b The ticket is kept open for 1.13 & 1.14, where the current patch does not apply. > Suspended SlotManager fail to reregister metrics when started again > ------------------------------------------------------------------- > > Key: FLINK-27420 > URL: https://issues.apache.org/jira/browse/FLINK-27420 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination, Runtime / Metrics > Affects Versions: 1.13.5 > Reporter: Ben Augarten > Assignee: Ben Augarten > Priority: Major > Labels: pull-request-available > Fix For: 1.16.0, 1.15.1 > > > The symptom is that SlotManager metrics are missing (taskslotsavailable and > taskslotstotal) when a SlotManager is suspended and then restarted. We > noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, > 1.15.x, and master. > > When a SlotManager is suspended, the [metrics group is > closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214]. > When the SlotManager is [started > again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181], > it makes an attempt to [reregister > metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],] > but that fails because the underlying metrics group [is still > closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393] > > > I was able to trace through this issue by restarting zookeeper nodes in a > staging environment and watching the JM with a debugger. > > A concise test, which currently fails, shows the expected behavior – > [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1] > > I am happy to provide a PR to fix this issue, but first would like to verify > that this is not intended. -- This message was sent by Atlassian Jira (v8.20.7#820007)