On Wed, 31 Jan 2024 15:33:01 GMT, Jan Kratochvil <jkratoch...@openjdk.org> 
wrote:

>> It looks like this patch needs a bit more discussion. Generally, it would be 
>> good to have the functionality, but I'm unsure about the proposed 
>> implementation.
>> 
>> Walking the directory hierarchy (**and** interface files) on every call of 
>> `physical_memory()` (or `active_processor_count()` for CPU) seems 
>> concerning. Since it seems unlikely for cgroup interface file paths changing 
>> during runtime, walking the hierarchy for every look-up seems overkill. Note 
>> that the prime users of this feature are in-container use, where most 
>> runtimes have the limit settings at the leaf. A few more comments below:
>> 
>> - After this patch, AFAIUI for a cgroup path like `/foo/bar/baz/memory.max` 
>> the following files are being looked at for the memory limit (for example 
>> for the interface file `memory.max`):
>>   ```
>>    /foo/bar/baz/memory.max
>>    /foo/bar/memory.max
>>    /foo/memory.max
>>   ```
>>   This used to be just one file at the leaf, `/foo/bar/baz/memory.max` 
>> (prior this patch). So this change has an effect on file IO. `N` files per 
>> metric to look at where it used to be one file (i.e. constant). Note that 
>> switches like `UseDynamicNumberOfCompilerThreads` make this particularly 
>> worrying.  I wonder if it would be sufficient to "lock" into the cgroup path 
>> where the lowest  limits are set in the hierarchy and only use the interface 
>> files at that path of a specific controller. This would mean adjusting 
>> `CgroupsV2Subsystem` similar to `CgroupsV1Subsystem` to have unique 
>> controller paths (i.e. `cpu` and `memory` controller paths could potentially 
>> be different). But the code would already be mostly there for this. The idea 
>> would be to do the initial walk of the tree at JVM startup, and from then 
>> on, only look at the path with a limit set (if any) for subsequent calls. 
>> That is `controller->subsystem_path()` would return the correct path at 
>> runtime when initialization is done
 . Thoughts?
>> - There seems to be an inconsistency between cgroups v1 (no recursive 
>> look-up) and cgroups v2  (recursive look-up). Why? I think we should employ 
>> the same semantics to both  versions (and drop the hierarchical work-around 
>> - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg 
>> v1).
>> - There also seems to be an inconsistency between metrics being iterated. 
>> `memory.max` and `memory.swap.max`  and `memory.swap.current` are being 
>> iterated, others, like CPU quotas (`cpu.max` or
>>   `cpu.weight`) not. Why? Both limits could potentially be one path up the 
>> hierarchy, meaning that cp...
>
>> Walking the directory hierarchy (**and** interface files) on every call of 
>> `physical_memory()` (or `active_processor_count()` for CPU) seems concerning.
> 
> I have therefore implemented a hierarchical `memory.stat` extension for 
> cgroup2 which already exists for cgroup1 - if the patch is agreed upon I will 
> submit it to Linux kernel: 
> [cgroup2-hierarchical_memory_limit.patch.txt](https://github.com/openjdk/jdk/files/14113452/cgroup2-hierarchical_memory_limit.patch.txt)
> 
>> Since it seems unlikely for cgroup interface file paths changing during 
>> runtime, walking the hierarchy for every look-up seems overkill. Note that 
>> the prime users of this feature are in-container use, where most runtimes 
>> have the limit settings at the leaf.
> 
> While I understand the most common change of the limits is at the leaf 
> technically any parent group limit can change. Which is also what this patch 
> is implementing. For the performance issue I have implemented the Linux 
> kernel extension above.
> 
>> `N` files per metric to look at where it used to be one file (i.e. constant).
> 
> The problem is any of the files can change. But to fix the performance I have 
> implemented the Linux kernel patch above.
> 
>> I wonder if it would be sufficient to "lock" into the cgroup path where the 
>> lowest  limits are set in the hierarchy and only use the interface files at 
>> that path of a specific controller.
> 
> That mostly disables the functionality of this patch.
> 
> 
>>     * There seems to be an inconsistency between cgroups v1 (no recursive 
>> look-up) and cgroups v2  (recursive look-up). Why? I think we should employ 
>> the same semantics to both  versions (and drop the hierarchical work-around 
>> - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg 
>> v1).
> 
> I did not much expect anyone still uses cgroup1. Plus customer was also 
> interested in cgroup2. AFAIK a last OS using cgroup1 was RHEL-7 which will 
> have EOL in a few months. But then I tried to implement cgroup1 but there it 
> sort of already worked thanks for your 
> [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338). I just fixed 
> there to prefer the hierarchical limit over the leaf limit (and the testcase 
> does test it now) as that is what is used by Linux kernel.
> 
> And then my Linux kernel patch implements the same `memory.stat` way even for 
> cgroup2.
> 
>>     * There also seems to be an inconsistency between metrics being 
>> iterated. `memory.max` and `memory.swap.max`  and `memory.swap.current` are 
>> being iterated, others, like CPU quotas (`cpu.max` or...

@jankratochvil The "use hierarchical limit" was a work-around that bites us 
now. So instead we should attempt to unify cgv1 and cgv2 by walking the 
hierarchy and find the place in the hierarchy where the interface files - with 
actual limits imposed - are to be found. Since this patch attempts to solve a 
similar goal to what the old work-around solved for cg v1, we should re-think 
how to solve it properly. My suggestion would be to go with walking the 
hierarchy for both: cg v1 and cg v2.

As to the walking the hierarchy on every invocation problem, I don't think 
fixing it by relying on a kernel patch is the way to go. It'll take time for 
the ecosystem to pick those up and existing cg v2 systems won't be fixed 
(consider RHEL 8 or RHEL 9 systems and their derivatives). Therefore, the idea 
would be to do the iteration **once** when the `OSContainer` code initializes. 
Specifically, in `OSContainer::init()` or `CgroupSubsystemFactory::create()`. 
There is the off-chance that somebody "migrating" a JVM process from one cgroup 
hierarchy to another, but I'm not sure this is something we need to support. 
Restarting the JVM seems OK. After all, the GC and other subsystems aren't 
really ready for dynamic updates anyway. We will address that problem when it 
actually becomes one.

Once `CgroupSubsystemFactory::create()` is done, the individual controllers 
would return the correct path to the interface file by asking 
`CgroupController::subsystem_path()` where it is. This also has  a nice 
symmetry with `memory` vs. `cpu` and other controllers. All of them could have 
their own `subsystem_path()`.

This would shift the iterating-over-paths to init time and would then keep them 
fixed, addressing the performance concern.

Overall this would reduce complexity IMO and would also fit well with a 
solution I'm working on for 
[JDK-8261242](https://bugs.openjdk.org/browse/JDK-8261242) which we need to 
solve for other reasons.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17198#issuecomment-1929296749

Reply via email to