[ https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721370#comment-17721370 ]
Tamas Domok commented on YARN-11490: ------------------------------------ h2. RCA YARN-11211 added a clear() call on the queue metrics cache after configuration validation. Which means: using the validation API two or more times will register new CSQueueMetrics object in the MetricsSystem (DefaultMetricsSystem), also the QUEUE_METRICS map will contain these new (should be temporary) CSQueueMetrics objects. (The validation API creates a new CapacityScheduler and the same queue hierarchy is created, plus the additional new queues.) Meanwhile the original queue hierarchy still has the proper (in-use) CSQueueMetrics, that's the reason why the /ws/v1/cluster/metrics endpoint still work. The attached log - [^hadoop-tdomok-resourcemanager-tdomok-MBP16.log] - contains an example, where the AbstractCSQueue and CSQueueMetrics object identity hash codes are visible, so one can follow up the initial queue hierarchy, then there are 2x validation api calls where at the second time new CSQueueMetrics objects are created and registered for the queues. Since the QUEUE_METRICS cache is currently add-only, the "leak" (caching forever) is there for normally adding and removing queues with unique queue name. A separate Jira can be filed for that. h2. Solution ideas # Revert YARN-11211, it's a nasty bug and the "leak" only causes problems if the validation API is abused with unique queue names. # Do not manipulate the DefaultMetricsSystem during configuration validation, not sure how to (maybe introduce a DummyMetricsSystem and use that in the validator CapacityScheduler). # Spawn a separate process for configuration validation with the proper config / state. Not sure if this is feasible or not, but it would be the cleanest. > JMX QueueMetrics breaks after mutable config validation in CS > ------------------------------------------------------------- > > Key: YARN-11490 > URL: https://issues.apache.org/jira/browse/YARN-11490 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.4.0 > Reporter: Tamas Domok > Assignee: Tamas Domok > Priority: Major > Attachments: hadoop-tdomok-resourcemanager-tdomok-MBP16.log > > > Reproduction steps: > 1. Submit a long running job > {code} > hadoop-3.4.0-SNAPSHOT/bin/yarn jar > hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar > sleep -m 1 -r 1 -rt 1200000 -mt 20 > {code} > 2. Verify that there is one running app > {code} > $ curl http://localhost:8088/ws/v1/cluster/metrics | jq > {code} > 3. Verify that the JMX endpoint reports 1 running app as well > {code} > $ curl http://localhost:8088/jmx | jq > {code} > 4. Validate the configuration (x2) > {code} > $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json > localhost:8088/ws/v1/cluster/scheduler-conf/validate > $ cat defaultqueue.json > {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null} > {code} > 5. Check 2. and 3. again. The cluster metrics should still work but the JMX > endpoint will show 0 running apps, that's the bug. > It is caused by YARN-11211, reverting that patch (or only removing the > _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that > would re-introduce the memory leak. > It looks like the QUEUE_METRICS hash map is "add-only", the > clearQueueMetrics() was only called from ResourceManager.reinitialize() > method (transitionToActive/transitionToStandby) prior to YARN-11211. > Constantly adding and removing queues with unique names would cause a leak as > well, because there is no remove from QUEUE_METRICS, so it is not just the > validation API that has this problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org