On Thu, 19 May 2022 19:59:18 GMT, Ioi Lam <ik...@openjdk.org> wrote:

>>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
>> 
>> With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, 
>> but then again it's a bit of a contrived example as those paths come from 
>> `/proc` parsing. Anyway, this is code that got added with 
>> [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not 
>> something I've written and to be honest, I'm not sure this branch is needed, 
>> but I didn't want to change the existing behaviour with this patch. I have 
>> no more insight than you in terms of why that approach has been taken.
>> 
>>> Maybe this block should be combined with the new `else` block you're adding?
>> 
>> Maybe, but I'm not sure if it would break something.
>> 
>>> However, the behavior seems opposite between these two blocks of code:
>>> 
>>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
>>> cgroup_path
>>> 
>>> ```
>>>   TestCase substring_match = {
>>>     "/sys/fs/cgroup/memory",                           // mount_path
>>>     "/user.slice/user-1000.slice",                     // root_path
>>>     "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
>>>     "/sys/fs/cgroup/memory/user@1001.service"          // expected_path
>>>   };
>>> ```
>> 
>> Yes. Though, I cannot comment on why that has been chosen. It's been there 
>> since day one :-/
>> 
>>> The lower block: The beginning of _root is a prefix of cgroup_path, we take 
>>> the **head** of cgroup_path
>>> 
>>> ```
>>>  TestCase prefix_matched_cg = {
>>>     "/sys/fs/cgroup/memory",                           // mount_path
>>>     "/user.slice/user-1000.slice/session-50.scope",    // root_path
>>>     "/user.slice/user-1000.slice/session-3.scope",     // cgroup_path
>>>     "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>>>   };
>>> ```
>>> 
>>> I find the behavior hard to understand. I think the rules (and reasons) 
>>> should be added to the comment block above the function.
>> 
>> The reason why I've gone down the path of adding the head of cgroup_path is 
>> because of this document (in conjunction to what the user was observing on 
>> an affected system):
>> https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
>> 
>> The user was observing paths as listed in the test:
>> 
>> "/user.slice/user-1000.slice/session-50.scope",    // root_path
>> "/user.slice/user-1000.slice/session-3.scope",     // cgroup_path
>> 
>> This very much looks like systemd managed. Given that and knowing that 
>> systemd adds processes into `scopes` or `services` and groups them via 
>> `slices` and also knowing that cgroups are hierarchical (i.e. limits of 
>> `/foo/bar` also apply to `/foo/bar/baz`), it seems likely that if there are 
>> any limits, then it'll be on `/user.slice/user-1000.slice` within the 
>> mounted controller. Unfortunately, I'm not able to reproduce this myself.
>
> I am wondering if the problem is this:
> 
> We have systemd running on the host, and a different copy of systemd that 
> runs inside the container.
> 
> - They both set up `/user.slice/user-1000.slice/session-??.scope` within 
> their own file systems
> - For some reason, when you're looking inside the container, 
> `/proc/self/cgroup` might use a path in the containerized file system whereas 
> `/proc/self/mountinfo` uses a path in the host file system. These two paths 
> may look alike but they have absolutely no relation to each other.
> 
> I have asked the reporter for more information:
> 
> https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
> 
> Meanwhile, I think the current method of finding "which directory under 
> /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned 
> about, the path you get from  `/proc/self/cgroup`  and `/proc/self/mountinfo` 
> have no relation to each other, but we use them anyway to get our answer, 
> with many ad-hoc methods that are not documented in the code.
> 
> Maybe we should do this instead?
> 
> - Read /proc/self/cgroup
> - Find the `10:memory:<path>` line
> - If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path
> - Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. Exactly 
> one of them contains my PID.
> 
> For example, here's a test with docker:
> 
> 
> INSIDE CONTAINER
> # cat /proc/self/cgroup | grep memory
> 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
> # cat /proc/self/mountinfo | grep memory
> 801 791 0:42 
> /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 
> /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup 
> cgroup rw,memory
> # cat 
> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
> cat: 
> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks:
>  No such file or directory
> # cat /sys/fs/cgroup/memory/tasks | grep $$
> 1
> 
> ON HOST
> # cat 
> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
> 37494
> # cat /proc/37494/status | grep NSpid
> NSpid:        37494   1

Also, I think the current PR could produce the wrong answer, if systemd is 
indeed running inside the container, and we have:


"/user.slice/user-1000.slice/session-50.scope",    // root_path
"/user.slice/user-1000.slice/session-3.scope",     // cgroup_path


The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies 
the overall memory limit for user-1000. However, the correct answer may be 
/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may 
have a smaller memory limit, and the JVM may end up allocating a larger heap 
than allowed.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8629

Reply via email to