Peter Szucs created YARN-11733: ---------------------------------- Summary: Fix the order of updating CPU controls with cgroup v1 Key: YARN-11733 URL: https://issues.apache.org/jira/browse/YARN-11733 Project: Hadoop YARN Issue Type: Bug Components: yarn Reporter: Peter Szucs Assignee: Peter Szucs
After YARN-11674 (Update CpuResourceHandler implementation for cgroup v2 support) the order of updating cpu.cfs_period_us and cpu.cfs_quota_us controls have changed which can cause the below errors when launching containers with CPU limits on cgroupv1: {code:java} PrintWriter unable to write to /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_000001/cpu.cfs_quota_us with value: 112500{code} *Reproduction:* I set CPU limits on yarn-site.xml for cgroup: {code:java} yarn.nodemanager.resource.percentage-physical-cpu-limit: 90 yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage: true{code} After that the limits were applied on the hadoop-yarn root hierarchy: {code:java} root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_period_us 1000000 root@pszucs-test-2 hadoop-yarn]# cat cpu.cfs_quota_us 900000 {code} When I tried to launch a container it gave me the following error: {code:java} PrintWriter unable to write to /var/cgroupv1/cpu/hadoop-yarn/container_e02_1727079571170_0040_02_000001/cpu.cfs_quota_us with value: 112500{code} It is because the container tries to exceed the limit defined at higher level with the 112 500 value for cfs_quota_us. If I try to create a test cgroup manually and try to update this control it lets me to do that up to the value of 90 000 as well: {code:java} [root@pszucs-test-2 hadoop-yarn]# cat test/cpu.cfs_period_us 100000 [root@pszucs-test-2 hadoop-yarn]# echo "90001" > test/cpu.cfs_quota_us -bash: echo: write error: Invalid argument [root@pszucs-test-2 hadoop-yarn]# echo "90000" > test/cpu.cfs_quota_us{code} *Solution:* The cause for this issue is that the cfs_period_us control get the default value of 100 000 when a new cgroup is created, but when YARN calculates the limit, it uses 1 000 000 for that. Because of this we need to update cpu.cfs_period_us before cpu.cfs_quota_us, to keep the ratio between the 2 values and not to overcome the limit defined at parent level. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org