[ 
https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721370#comment-17721370
 ] 

Tamas Domok edited comment on YARN-11490 at 5/11/23 6:32 AM:
-------------------------------------------------------------

h2. RCA

YARN-11211 added a clear() call on the queue metrics cache after configuration 
validation. Which means: using the validation API two or more times will 
register new CSQueueMetrics object in the MetricsSystem (DefaultMetricsSystem), 
also the QUEUE_METRICS map will contain these new (should be temporary) 
CSQueueMetrics objects. (The validation API creates a new CapacityScheduler and 
the same queue hierarchy is created, plus the additional new queues.) Meanwhile 
the original queue hierarchy still has the proper (in-use) CSQueueMetrics, 
that's the reason why the /ws/v1/cluster/metrics endpoint still work.

 
The attached log - [^hadoop-tdomok-resourcemanager-tdomok-MBP16.log] - contains 
an example, where the AbstractCSQueue and CSQueueMetrics object identity hash 
codes are visible, so one can follow up the initial queue hierarchy, then there 
are 2x validation api calls where at the second time new CSQueueMetrics objects 
are created and registered for the queues.

h3. YARN-11211 did not solve the leak

The QUEUE_METRICS is cleared, however the MetricsSystem will store the 
CSQueueMetrics objects, so you can see the validated queues in the JMX response.

h3. It's not just the validation API

Since the QUEUE_METRICS cache is currently add-only, the "leak" (caching 
forever) is there for normally adding and removing queues with unique queue 
name. A separate Jira can be filed for that.


h2. Solution ideas
 # Revert YARN-11211, it's a nasty bug and the "leak" only causes problems if 
the validation API is abused with unique queue names. Note that YARN-11211 did 
not solve the leak problem either, details above.
 # Do not manipulate the DefaultMetricsSystem during configuration validation, 
not sure how to (maybe introduce a DummyMetricsSystem and use that in the 
validator CapacityScheduler).
 # Spawn a separate process for configuration validation with the proper config 
/ state. Not sure if this is feasible or not, but it would be the cleanest.

Probably 1. and 2. should be done.


was (Author: tdomok):
h2. RCA

YARN-11211 added a clear() call on the queue metrics cache after configuration 
validation. Which means: using the validation API two or more times will 
register new CSQueueMetrics object in the MetricsSystem (DefaultMetricsSystem), 
also the QUEUE_METRICS map will contain these new (should be temporary) 
CSQueueMetrics objects. (The validation API creates a new CapacityScheduler and 
the same queue hierarchy is created, plus the additional new queues.) Meanwhile 
the original queue hierarchy still has the proper (in-use) CSQueueMetrics, 
that's the reason why the /ws/v1/cluster/metrics endpoint still work.

 
The attached log - [^hadoop-tdomok-resourcemanager-tdomok-MBP16.log] - contains 
an example, where the AbstractCSQueue and CSQueueMetrics object identity hash 
codes are visible, so one can follow up the initial queue hierarchy, then there 
are 2x validation api calls where at the second time new CSQueueMetrics objects 
are created and registered for the queues.


Since the QUEUE_METRICS cache is currently add-only, the "leak" (caching 
forever) is there for normally adding and removing queues with unique queue 
name. A separate Jira can be filed for that.


h2. Solution ideas
 # Revert YARN-11211, it's a nasty bug and the "leak" only causes problems if 
the validation API is abused with unique queue names.
 # Do not manipulate the DefaultMetricsSystem during configuration validation, 
not sure how to (maybe introduce a DummyMetricsSystem and use that in the 
validator CapacityScheduler).
 # Spawn a separate process for configuration validation with the proper config 
/ state. Not sure if this is feasible or not, but it would be the cleanest.

> JMX QueueMetrics breaks after mutable config validation in CS
> -------------------------------------------------------------
>
>                 Key: YARN-11490
>                 URL: https://issues.apache.org/jira/browse/YARN-11490
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.4.0
>            Reporter: Tamas Domok
>            Assignee: Tamas Domok
>            Priority: Major
>         Attachments: hadoop-tdomok-resourcemanager-tdomok-MBP16.log
>
>
> Reproduction steps:
> 1. Submit a long running job
> {code}
> hadoop-3.4.0-SNAPSHOT/bin/yarn jar 
> hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar
>  sleep -m 1 -r 1 -rt 1200000 -mt 20
> {code}
> 2. Verify that there is one running app
> {code}
> $ curl http://localhost:8088/ws/v1/cluster/metrics | jq
> {code}
> 3. Verify that the JMX endpoint reports 1 running app as well
> {code}
> $ curl http://localhost:8088/jmx | jq
> {code}
> 4. Validate the configuration (x2)
> {code}
> $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json 
> localhost:8088/ws/v1/cluster/scheduler-conf/validate
> $ cat defaultqueue.json
> {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null}
> {code}
> 5. Check 2. and 3. again. The cluster metrics should still work but the JMX 
> endpoint will show 0 running apps, that's the bug.
> It is caused by YARN-11211, reverting that patch (or only removing the 
> _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that 
> would re-introduce the memory leak.
> It looks like the QUEUE_METRICS hash map is "add-only", the 
> clearQueueMetrics() was only called from ResourceManager.reinitialize() 
> method (transitionToActive/transitionToStandby) prior to YARN-11211. 
> Constantly adding and removing queues with unique names would cause a leak as 
> well, because there is no remove from QUEUE_METRICS, so it is not just the 
> validation API that has this problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to