[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-06-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554587#comment-17554587
 ] 

Xintong Song commented on FLINK-27420:
--

[~baugarten],

Are you still going to work on this? If you don't have the time, I can take it 
over and fix for 1.14. I'll probably leave it as is for 1.13.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-06-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554582#comment-17554582
 ] 

Xintong Song commented on FLINK-27420:
--

[~danderson],

This has already been fixed for:
- master (1.16.0): 525e3170c622da65becfab3e8afe97303b07b7db
- 1.15.1: fbc8e460cd0f80ecb4387855ab982d896e95af3b

The ticket is kept open for 1.13 & 1.14, where the current patch does not apply.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-06-15 Thread David Anderson (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554569#comment-17554569
 ] 

David Anderson commented on FLINK-27420:


[~xtsong] What's the status here? Do you want to finalize this for 1.15.1? I 
don't intend to block the release on it, but we are waiting on something else 
at the moment.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.15.1, 1.14.6
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-05-04 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532016#comment-17532016
 ] 

Xintong Song commented on FLINK-27420:
--

Thanks [~gaoyunhaii] for fixing the broken branches, and [~Nicolaus Weidner] 
for looking into this.

I noticed that [~baugarten] has already opened a new PR for the 1.15 & 1.14 
branches. I'll help with finalizing the PR.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.14.5, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-05-03 Thread Nicolaus Weidner (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531222#comment-17531222
 ] 

Nicolaus Weidner commented on FLINK-27420:
--

Looks fine on master, but in the backports, a test was backported without 
checking that the same variables are available. E.g. on release-1.15,  
delegationTokenManager is undefined: 
https://github.com/apache/flink/blob/e0c82d6d52871dbbea70c9b41384d2d33179bec0/flink-runtime/src/test/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImplTest.java#L520.
 It was added in this commit: 
https://github.com/apache/flink/commit/26aa543b3bbe2b606bbc6d332a2ef7c5b46d25eb

I didn't check for the specific issue on release-1.14

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.14.5, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-05-02 Thread Yun Gao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531039#comment-17531039
 ] 

Yun Gao commented on FLINK-27420:
-

Hi, I'll first revert the commits on 1.15 and 1.14 since they fails the 
compilations. Let's have a double check and re-merge the commit. 

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0, 1.14.5, 1.15.1
>
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-29 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529866#comment-17529866
 ] 

Xintong Song commented on FLINK-27420:
--

Fixed via
- master (1.16): 525e3170c622da65becfab3e8afe97303b07b7db
- release-1.15: e0c82d6d52871dbbea70c9b41384d2d33179bec0
- release-1.14: 054b59bb97d14e453745e18f8d9cc90b109bb33a

Leave the ticket open to fix for 1.13

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-28 Thread Ben Augarten (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529741#comment-17529741
 ] 

Ben Augarten commented on FLINK-27420:
--

[~xtsong] – thanks, I posted a PR against master here: 
[https://github.com/apache/flink/pull/19607]. Let me know if you'd like any 
changes to the PR and I'm happy to make them. I can post PRs against 
release-1.14 and release-1.15 after.

That's good to know that there might be another bugfix release for 1.13.x. I'm 
going to work on a patch for 1.13 as well because most of our applications run 
on 1.13.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>  Labels: pull-request-available
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-27 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529153#comment-17529153
 ] 

Xintong Song commented on FLINK-27420:
--

[~baugarten],

Thanks for volunteering to fix this. I've assigned you to this ticket.

Regarding 1.13, I'm surprised that you have the JM process live through 
multiple leader sessions. IIRC, we had tried it before making the changes in 
1.14, and JM process was terminated after losing the leadership. Unfortunately 
I cannot recall more details about how it was terminated. Anyway, if that works 
for you, I'd be fine with also fixing this for 1.13.

According to the [Update Policy for old 
releases|https://flink.apache.org/downloads.html#update-policy-for-old-releases],
 the Flink community provides supports for the latest 2 versions, but is also 
open to discussing bugfix releases for older versions. Actually, it is not rare 
that we ship bugfix release for the 3rd latest version. To sum up, although I 
don't know whether (and when) there will be a next bugfix release for 1.13.x, I 
would not consider a patch for 1.13 is not desired.

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Assignee: Ben Augarten
>Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-27 Thread Ben Augarten (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529041#comment-17529041
 ] 

Ben Augarten commented on FLINK-27420:
--

Thanks for adding that context!

For 1.14 & 1.15, I looked through the implementation of 
`ResourceManagerServiceImpl`, which seems to be new in 1.14, and I follow what 
you're saying. I agree that the `resourceManagerMetricGroup` and 
`slotManagerMetricGroup` are the only metrics affected and that storing the 
`metricRegistry` (and `hostname`, which is required to create the metric group) 
and creating the metric group with each new leader session makes the most 
sense. I can start on this change and could post a PR tomorrow during US 
working hours.

 

I'm not sure I follow your point about 1.13 though. I currently run 1.13 in 
standalone, session mode and the JM process does seem to live through multiple 
leader sessions. Regardless, my understanding is that Flink only supports the 
latest two versions, which would be 1.14 and 1.15, so a patch for 1.13 is not 
desired – is that correct?

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-26 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528495#comment-17528495
 ] 

Xintong Song commented on FLINK-27420:
--

Thanks for reporting this, [~baugarten].

This is indeed a valid issue. I'd like to add a bit more clarification.
* For 1.13, I don't think it is supported for the JM process to live through 
multiple leader sessions, i.e. being revoked and re-granted leadership without 
failing the process. I know some of the codes look like it is supported, but 
unfortunately it never really worked until FLINK-23240 which is fixed in 1.14.4.
* For 1.14 & 1.15, yes, the issue still exist. Since 1.14, for each leader 
session we create a new ResourceManager instance. However, some of the 
components and services are preserved in {{ResourceManagerProcessContext}} and 
are reused across multiple RM instances. If these components / services are 
closed, they need to be restarted properly. I've checked the current 
implementation, and it seems the only things that are affected are 
{{resourceManagerMetricGroup}} and {{slotManagerMetricGroup}}. I think the 
easiest way to fix this is probably to store {{metricRegistry}} rather than the 
metric groups in {{ResourceManagerProcessContext}}, so that we can create new 
metric group instances for each leader session.

WDYT?

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27420) Suspended SlotManager fail to reregister metrics when started again

2022-04-26 Thread Guowei Ma (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528478#comment-17528478
 ] 

Guowei Ma commented on FLINK-27420:
---

Thanks [~baugarten] for reporting this. cc [~xtsong] 

> Suspended SlotManager fail to reregister metrics when started again
> ---
>
> Key: FLINK-27420
> URL: https://issues.apache.org/jira/browse/FLINK-27420
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination, Runtime / Metrics
>Affects Versions: 1.13.5
>Reporter: Ben Augarten
>Priority: Major
>
> The symptom is that SlotManager metrics are missing (taskslotsavailable and 
> taskslotstotal) when a SlotManager is suspended and then restarted. We 
> noticed this issue when running 1.13.5, but I believe this impacts 1.14.x, 
> 1.15.x, and master.
>  
> When a SlotManager is suspended, the [metrics group is 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L214].
>  When the SlotManager is [started 
> again|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L181],
>  it makes an attempt to [reregister 
> metrics|[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/DeclarativeSlotManager.java#L199-L202],]
>  but that fails because the underlying metrics group [is still 
> closed|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/AbstractMetricGroup.java#L393]
>  
>  
> I was able to trace through this issue by restarting zookeeper nodes in a 
> staging environment and watching the JM with a debugger. 
>  
> A concise test, which currently fails, shows the expected behavior – 
> [https://github.com/apache/flink/compare/master...baugarten:baugarten/slot-manager-missing-metrics?expand=1]
>  
> I am happy to provide a PR to fix this issue, but first would like to verify 
> that this is not intended.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)