[PR] YARN-11678. Update CGroupElasticMemoryController for cgroup v2 support [hadoop]

via GitHub Mon, 24 Feb 2025 23:06:23 -0800


laysfire opened a new pull request, #7430:
URL: https://github.com/apache/hadoop/pull/7430

### Description of PR
Currently, CGroupElasticMemoryController's implementation is based on CGroup
V1. The PR si to Update CGroupElasticMemoryController for cgroup v2 support.

The CGroupElasticMemoryController's implementation is following:
1. Disable OOM Killer by writing 1 to memory.oom_control file
2. Update memory limit, memory.memsw.limit_in_bytes for virtual memory
control, and memory.limit_in_bytes for physical memory control
3. Launch subprocess OOM-Listener to listen OOM event through
cgroup.event_control
4. When OOM happens, OOM-Listener notify NM
5. NM call DefaultOOMHandler to resolve OOM

While in CGroup V2, there is no way to disable OOM Killer. It means that
once memory usage exceed the threshold containers will be killed randomly by
system and NM can not do anything.
But CGroup V2 provide throttle mechanism. The memory.high is the memory
usage throttle limit. If a cgroup's memory use goes over the high boundary
specified here, the cgroup's processes are throttled and put under heavy
reclaim pressure (refer
https://facebookmicrosites.github.io/cgroup2/docs/memory-controller.html#:~:text=memory.max-,memory.,put%20under%20heavy%20reclaim%20pressure.).

And CGroup V2 provide PSI (Pressure Stall Information), we can get
notification by writing information to memory.pressure(refer
https://docs.kernel.org/accounting/psi.html).

So the implementation based CGroup V2 can be as follows:
1. Update memory limit, memory.swap.max for virtual memory control and
memory.high & memory.max for physical memory control (memory.max shoulb be a
little more than memory.high, maybe 5GB)
2. Launch subprocess OOM-Listener to monitor memory pressure by writing some
like "some 150000 100000" to memory.pressure file.
3. Once the memory usage goes over memory.high, the processes are throttled
which mean that some tasks will be stall.
4. OOM-Listener is woken up and notify the NM
5. NM call DefaultOOMHandler to resolve OOM

### How was this patch tested?
Unit test and manually test.

### For code changes:

- [ ] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] YARN-11678. Update CGroupElasticMemoryController for cgroup v2 support [hadoop]

Reply via email to