On Thu, 19 May 2022 20:18:50 GMT, Ioi Lam <ik...@openjdk.org> wrote:

>> I am wondering if the problem is this:
>> 
>> We have systemd running on the host, and a different copy of systemd that 
>> runs inside the container.
>> 
>> - They both set up `/user.slice/user-1000.slice/session-??.scope` within 
>> their own file systems
>> - For some reason, when you're looking inside the container, 
>> `/proc/self/cgroup` might use a path in the containerized file system 
>> whereas `/proc/self/mountinfo` uses a path in the host file system. These 
>> two paths may look alike but they have absolutely no relation to each other.
>> 
>> I have asked the reporter for more information:
>> 
>> https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
>> 
>> Meanwhile, I think the current method of finding "which directory under 
>> /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned 
>> about, the path you get from  `/proc/self/cgroup`  and 
>> `/proc/self/mountinfo` have no relation to each other, but we use them 
>> anyway to get our answer, with many ad-hoc methods that are not documented 
>> in the code.
>> 
>> Maybe we should do this instead?
>> 
>> - Read /proc/self/cgroup
>> - Find the `10:memory:<path>` line
>> - If `/sys/fs/cgroup/memory/<path>/tasks` contains my PID, this is the path
>> - Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. Exactly 
>> one of them contains my PID.
>> 
>> For example, here's a test with docker:
>> 
>> 
>> INSIDE CONTAINER
>> # cat /proc/self/cgroup | grep memory
>> 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
>> # cat /proc/self/mountinfo | grep memory
>> 801 791 0:42 
>> /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 
>> /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup 
>> cgroup rw,memory
>> # cat 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> cat: 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks:
>>  No such file or directory
>> # cat /sys/fs/cgroup/memory/tasks | grep $$
>> 1
>> 
>> ON HOST
>> # cat 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> 37494
>> # cat /proc/37494/status | grep NSpid
>> NSpid:       37494   1
>
> Also, I think the current PR could produce the wrong answer, if systemd is 
> indeed running inside the container, and we have:
> 
> 
> "/user.slice/user-1000.slice/session-50.scope",    // root_path
> "/user.slice/user-1000.slice/session-3.scope",     // cgroup_path
> 
> 
> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which 
> specifies the overall memory limit for user-1000. However, the correct answer 
> may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, 
> which may have a smaller memory limit, and the JVM may end up allocating a 
> larger heap than allowed.

Yes, if we can decide which one the right file is. This is largely undocumented 
territory. The correct fix is a) find the correct path to the namespace 
hierarchy the process is a part of. b) starting at the leaf node, walk up the 
hierarchy and find the **lowest** limits. Doing this would be very expensive!

Aside: Current container detection in the JVM/JDK is notoriously imprecise. 
It's largely based on common setups (containers like docker). The heuristics 
assume that memory limits are reported inside the container at the leaf node. 
If, however, that's not the case, the detected limits will be wrong (it will 
detect it as unlimited, even though it's - for example - memory constrained at 
the parent). This can for example be reproduced on a cgroups v2 system with a 
systemd slice using memory limits. We've worked-around this in OpenJDK for 
cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338

-------------

PR: https://git.openjdk.java.net/jdk/pull/8629

Reply via email to