On Mon, 23 May 2022 09:24:19 GMT, Severin Gehwolf <sgehw...@openjdk.org> wrote:
>> Also, I think the current PR could produce the wrong answer, if systemd is >> indeed running inside the container, and we have: >> >> >> "/user.slice/user-1000.slice/session-50.scope", // root_path >> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path >> >> >> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which >> specifies the overall memory limit for user-1000. However, the correct >> answer may be >> /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may >> have a smaller memory limit, and the JVM may end up allocating a larger heap >> than allowed. > > Yes, if we can decide which one the right file is. This is largely > undocumented territory. The correct fix is a) find the correct path to the > namespace hierarchy the process is a part of. b) starting at the leaf node, > walk up the hierarchy and find the **lowest** limits. Doing this would be > very expensive! > > Aside: Current container detection in the JVM/JDK is notoriously imprecise. > It's largely based on common setups (containers like docker). The heuristics > assume that memory limits are reported inside the container at the leaf node. > If, however, that's not the case, the detected limits will be wrong (it will > detect it as unlimited, even though it's - for example - memory constrained > at the parent). This can for example be reproduced on a cgroups v2 system > with a systemd slice using memory limits. We've worked-around this in OpenJDK > for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338 > Maybe we should do this instead? > > * Read /proc/self/cgroup > > * Find the `10:memory:<path>` line > > * If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the > path > > * Otherwise, scan all `tasks` files under `/sys/fs/cgroup/memory/`. > Exactly one of them contains my PID. Something like that seems most promising, but it would have to be `cgroup.procs` not `tasks` as `tasks` is the task id (i.e. Linux's thread), not the process. We could keep the two common cases as short circuiting. I.e. host and docker cases in the test. ------------- PR: https://git.openjdk.java.net/jdk/pull/8629