[ https://issues.apache.org/jira/browse/YARN-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721946#comment-17721946 ]
Szilard Nemeth commented on YARN-11490: --------------------------------------- Hi [~tdomok], Nice finding. I do agree with your statements. 1. The memory leak {quote} Revert YARN-11211, it's a nasty bug and the "leak" only causes problems if the validation API is abused with unique queue names. Note that YARN-11211 did not solve the leak problem either, details above. {quote} Good that you characterized the nature of the leak, I think it's okay to revert YARN-11211 in this case. Please file a separate bug ticket for the leak. 3. Validate separately: {quote} Spawn a separate process for configuration validation with the proper config / state. Not sure if this is feasible or not, but it would be the cleanest. {quote} I agree that this would be the cleanest approach but given the current state of the codebase I really doubt it's easy to implement. > JMX QueueMetrics breaks after mutable config validation in CS > ------------------------------------------------------------- > > Key: YARN-11490 > URL: https://issues.apache.org/jira/browse/YARN-11490 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.4.0 > Reporter: Tamas Domok > Assignee: Tamas Domok > Priority: Major > Labels: pull-request-available > Attachments: addqueue.xml, defaultqueue.json, > hadoop-tdomok-resourcemanager-tdomok-MBP16.log, removequeue.xml, > stopqueue.json > > > Reproduction steps: > 1. Submit a long running job > {code} > hadoop-3.4.0-SNAPSHOT/bin/yarn jar > hadoop-3.4.0-SNAPSHOT/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.4.0-SNAPSHOT-tests.jar > sleep -m 1 -r 1 -rt 1200000 -mt 20 > {code} > 2. Verify that there is one running app > {code} > $ curl http://localhost:8088/ws/v1/cluster/metrics | jq > {code} > 3. Verify that the JMX endpoint reports 1 running app as well > {code} > $ curl http://localhost:8088/jmx | jq > {code} > 4. Validate the configuration (x2) > {code} > $ curl -X POST -H 'Content-Type: application/json' -d @defaultqueue.json > localhost:8088/ws/v1/cluster/scheduler-conf/validate > $ cat defaultqueue.json > {"update-queue":{"queue-name":"root.default","params":{"entry":{"key":"maximum-applications","value":"100"}}},"subClusterId":"","global":null,"global-updates":null} > {code} > 5. Check 2. and 3. again. The cluster metrics should still work but the JMX > endpoint will show 0 running apps, that's the bug. > It is caused by YARN-11211, reverting that patch (or only removing the > _QueueMetrics.clearQueueMetrics();_ line) fixes the issue. But I think that > would re-introduce the memory leak. > It looks like the QUEUE_METRICS hash map is "add-only", the > clearQueueMetrics() was only called from ResourceManager.reinitialize() > method (transitionToActive/transitionToStandby) prior to YARN-11211. > Constantly adding and removing queues with unique names would cause a leak as > well, because there is no remove from QUEUE_METRICS, so it is not just the > validation API that has this problem. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org