On Thu, 19 May 2022 19:59:18 GMT, Ioi Lam <ik...@openjdk.org> wrote: >>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"? >> >> With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, >> but then again it's a bit of a contrived example as those paths come from >> `/proc` parsing. Anyway, this is code that got added with >> [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not >> something I've written and to be honest, I'm not sure this branch is needed, >> but I didn't want to change the existing behaviour with this patch. I have >> no more insight than you in terms of why that approach has been taken. >> >>> Maybe this block should be combined with the new `else` block you're adding? >> >> Maybe, but I'm not sure if it would break something. >> >>> However, the behavior seems opposite between these two blocks of code: >>> >>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of >>> cgroup_path >>> >>> ``` >>> TestCase substring_match = { >>> "/sys/fs/cgroup/memory", // mount_path >>> "/user.slice/user-1000.slice", // root_path >>> "/user.slice/user-1000.slice/user@1001.service", // cgroup_path >>> "/sys/fs/cgroup/memory/user@1001.service" // expected_path >>> }; >>> ``` >> >> Yes. Though, I cannot comment on why that has been chosen. It's been there >> since day one :-/ >> >>> The lower block: The beginning of _root is a prefix of cgroup_path, we take >>> the **head** of cgroup_path >>> >>> ``` >>> TestCase prefix_matched_cg = { >>> "/sys/fs/cgroup/memory", // mount_path >>> "/user.slice/user-1000.slice/session-50.scope", // root_path >>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path >>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path >>> }; >>> ``` >>> >>> I find the behavior hard to understand. I think the rules (and reasons) >>> should be added to the comment block above the function. >> >> The reason why I've gone down the path of adding the head of cgroup_path is >> because of this document (in conjunction to what the user was observing on >> an affected system): >> https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ >> >> The user was observing paths as listed in the test: >> >> "/user.slice/user-1000.slice/session-50.scope", // root_path >> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path >> >> This very much looks like systemd managed. Given that and knowing that >> systemd adds processes into `scopes` or `services` and groups them via >> `slices` and also knowing that cgroups are hierarchical (i.e. limits of >> `/foo/bar` also apply to `/foo/bar/baz`), it seems likely that if there are >> any limits, then it'll be on `/user.slice/user-1000.slice` within the >> mounted controller. Unfortunately, I'm not able to reproduce this myself. > > I am wondering if the problem is this: > > We have systemd running on the host, and a different copy of systemd that > runs inside the container. > > - They both set up `/user.slice/user-1000.slice/session-??.scope` within > their own file systems > - For some reason, when you're looking inside the container, > `/proc/self/cgroup` might use a path in the containerized file system whereas > `/proc/self/mountinfo` uses a path in the host file system. These two paths > may look alike but they have absolutely no relation to each other. > > I have asked the reporter for more information: > > https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593 > > Meanwhile, I think the current method of finding "which directory under > /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned > about, the path you get from `/proc/self/cgroup` and `/proc/self/mountinfo` > have no relation to each other, but we use them anyway to get our answer, > with many ad-hoc methods that are not documented in the code. > > Maybe we should do this instead? > > - Read /proc/self/cgroup > - Find the `10:memory:<path>` line > - If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path > - Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. Exactly > one of them contains my PID. > > For example, here's a test with docker: > > > INSIDE CONTAINER > # cat /proc/self/cgroup | grep memory > 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 > # cat /proc/self/mountinfo | grep memory > 801 791 0:42 > /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 > /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup > cgroup rw,memory > # cat > /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks > cat: > /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks: > No such file or directory > # cat /sys/fs/cgroup/memory/tasks | grep $$ > 1 > > ON HOST > # cat > /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks > 37494 > # cat /proc/37494/status | grep NSpid > NSpid: 37494 1
Also, I think the current PR could produce the wrong answer, if systemd is indeed running inside the container, and we have: "/user.slice/user-1000.slice/session-50.scope", // root_path "/user.slice/user-1000.slice/session-3.scope", // cgroup_path The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which specifies the overall memory limit for user-1000. However, the correct answer may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may have a smaller memory limit, and the JVM may end up allocating a larger heap than allowed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8629