Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-18 Thread Severin Gehwolf
> Please review this change to the cgroup v1 subsystem which makes it more 
> resilient on some of the stranger systems. Unfortunately, I wasn't able to 
> re-create a similar system as the reporter. The idea of using the longest 
> substring match for the cgroupv1 file paths was based on the fact that on 
> systemd systems processes run in separate scopes and the maven forked test 
> runner might exhibit this property. For that it makes sense to use the common 
> ancestor path. Nothing changes in the common cases where the `cgroup_path` 
> matches `_root` and when the `_root` is `/` (container the former, host 
> system the latter).
> 
> In addition, the code paths which are susceptible to throw NPE have been 
> hardened to catch those situations. Should it happen in the future it makes 
> more sense (to me) to not have accurate container detection in favor of 
> continuing to keep running.
> 
> Finally, with the added unit-tests a bug was uncovered on the "substring" 
> match case of cgroup paths in hotspot. `p` returned from `strstr` can never 
> point to `_root` as it's used as the "needle" to find in "haystack" 
> `cgroup_path` (not the other way round).
> 
> Testing:
> - [x] Added unit tests
> - [x] GHA
> - [x] Container tests on cgroups v1 Linux. Continue to pass

Severin Gehwolf has updated the pull request incrementally with two additional 
commits since the last revision:

 - Refactor hotspot gtest
 - Separate into function. Fix comment.

-

Changes:
  - all: https://git.openjdk.java.net/jdk/pull/8629/files
  - new: https://git.openjdk.java.net/jdk/pull/8629/files/66241aa5..4b8e92fa

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8629&range=02
 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8629&range=01-02

  Stats: 93 lines in 3 files changed: 56 ins; 25 del; 12 mod
  Patch: https://git.openjdk.java.net/jdk/pull/8629.diff
  Fetch: git fetch https://git.openjdk.java.net/jdk pull/8629/head:pull/8629

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-18 Thread Ioi Lam
On Wed, 18 May 2022 18:14:52 GMT, Severin Gehwolf  wrote:

>> Please review this change to the cgroup v1 subsystem which makes it more 
>> resilient on some of the stranger systems. Unfortunately, I wasn't able to 
>> re-create a similar system as the reporter. The idea of using the longest 
>> substring match for the cgroupv1 file paths was based on the fact that on 
>> systemd systems processes run in separate scopes and the maven forked test 
>> runner might exhibit this property. For that it makes sense to use the 
>> common ancestor path. Nothing changes in the common cases where the 
>> `cgroup_path` matches `_root` and when the `_root` is `/` (container the 
>> former, host system the latter).
>> 
>> In addition, the code paths which are susceptible to throw NPE have been 
>> hardened to catch those situations. Should it happen in the future it makes 
>> more sense (to me) to not have accurate container detection in favor of 
>> continuing to keep running.
>> 
>> Finally, with the added unit-tests a bug was uncovered on the "substring" 
>> match case of cgroup paths in hotspot. `p` returned from `strstr` can never 
>> point to `_root` as it's used as the "needle" to find in "haystack" 
>> `cgroup_path` (not the other way round).
>> 
>> Testing:
>> - [x] Added unit tests
>> - [x] GHA
>> - [x] Container tests on cgroups v1 Linux. Continue to pass
>
> Severin Gehwolf has updated the pull request incrementally with two 
> additional commits since the last revision:
> 
>  - Refactor hotspot gtest
>  - Separate into function. Fix comment.

Changes requested by iklam (Reviewer).

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 60:

> 58: strncpy(buf, _mount_point, MAXPATHLEN);
> 59: buf[MAXPATHLEN-1] = '\0';
> 60: _path = os::strdup(buf);

Comment on the old code: why this cannot be simply


_path = os::strdup(_mount_point);


Also, all the strncat calls in this function seem problematic, and should be 
converted to stringStream for consistency.

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 63:

> 61:   } else {
> 62: char *p = strstr(cgroup_path, _root);
> 63: if (p != NULL && p == cgroup_path) {

What happens if cgroup_path is "/foo/bar" and _root is "/fo"?

Maybe this block should be combined with the new `else` block you're adding? 
However, the behavior seems opposite between these two blocks of code:

The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
cgroup_path


  TestCase substring_match = {
"/sys/fs/cgroup/memory",   // mount_path
"/user.slice/user-1000.slice", // root_path
"/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
"/sys/fs/cgroup/memory/user@1001.service"  // expected_path
  };


The lower block: The beginning of _root is a prefix of cgroup_path, we take the 
**head** of cgroup_path


 TestCase prefix_matched_cg = {
"/sys/fs/cgroup/memory",   // mount_path
"/user.slice/user-1000.slice/session-50.scope",// root_path
"/user.slice/user-1000.slice/session-3.scope", // cgroup_path
"/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
  };


I find the behavior hard to understand. I think the rules (and reasons) should 
be added to the comment block above the function.

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 86:

> 84:   const char* cgroup_p = cgroup_path;
> 85:   int last_slash = find_last_slash_pos(root_p, cgroup_p);
> 86:   assert(last_slash >= 0, "not an absolute path?");

Are root_p and cgroup_p directly read from the /proc/xxx files. If so, do we 
validate the input to make sure they are absolute paths?

It seems like our code cannot handle trailing '/' in the input. If so, we 
should clear all trailing '/' from the input string. Then, in functions that 
process them, we should assert that they don't end with slash. See my comment 
in find_last_slash_pos().

src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 102:

> 100:   for (int i = 0; *s1 == *s2 && *s1 != 0; i++) {
> 101: if (*s1 == '/') {
> 102:   last_matching_slash_pos = i;

I found the behavior a little hard to understand. Is it intentional?


"/cat/cow", "/cat/dog"-> "/cat/"
"/cat/","/cat/dog"-> "/cat/"
"/cat", "/cat/dog"-> "/"


The name `find_last_slash_pos` doesn't properly describe the behavior. I 
thought about `find_common_path_prefix`, but that doesn't match the current 
behavior (for the 3rd case, the common path prefix is `"/cat"`).

-

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-19 Thread Severin Gehwolf
On Thu, 19 May 2022 05:53:19 GMT, Ioi Lam  wrote:

>> Severin Gehwolf has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Refactor hotspot gtest
>>  - Separate into function. Fix comment.
>
> src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 60:
> 
>> 58: strncpy(buf, _mount_point, MAXPATHLEN);
>> 59: buf[MAXPATHLEN-1] = '\0';
>> 60: _path = os::strdup(buf);
> 
> Comment on the old code: why this cannot be simply
> 
> 
> _path = os::strdup(_mount_point);
> 
> 
> Also, all the strncat calls in this function seem problematic, and should be 
> converted to stringStream for consistency.

Agreed. I've filed https://bugs.openjdk.java.net/browse/JDK-8287007 to track 
this. I'd like to limit the changes of this patch to a minimum. It seems 
orthogonal.

> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?

With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, 
but then again it's a bit of a contrived example as those paths come from 
`/proc` parsing. Anyway, this is code that got added with 
[JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not 
something I've written and to be honest, I'm not sure this branch is needed, 
but I didn't want to change the existing behaviour with this patch. I have no 
more insight than you in terms of why that approach has been taken.

> Maybe this block should be combined with the new `else` block you're adding?

Maybe, but I'm not sure if it would break something.

> However, the behavior seems opposite between these two blocks of code:
> 
> The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
> cgroup_path
> 
> ```
>   TestCase substring_match = {
> "/sys/fs/cgroup/memory",   // mount_path
> "/user.slice/user-1000.slice", // root_path
> "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
> "/sys/fs/cgroup/memory/user@1001.service"  // expected_path
>   };
> ```

Yes. Though, I cannot comment on why that has been chosen. It's been there 
since day one :-/

> The lower block: The beginning of _root is a prefix of cgroup_path, we take 
> the **head** of cgroup_path
> 
> ```
>  TestCase prefix_matched_cg = {
> "/sys/fs/cgroup/memory",   // mount_path
> "/user.slice/user-1000.slice/session-50.scope",// root_path
> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>   };
> ```
> 
> I find the behavior hard to understand. I think the rules (and reasons) 
> should be added to the comment block above the function.

The reason why I've gone down the path of adding the head of cgroup_path is 
because of this document (in conjunction to what the user was observing on an 
affected system):
https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

The user was observing paths as listed in the test:

"/user.slice/user-1000.slice/session-50.scope",// root_path
"/user.slice/user-1000.slice/session-3.scope", // cgroup_path

This very much looks like systemd managed. Given that and knowing that systemd 
adds processes into `scopes` or `services` and groups them via `slices` and 
also knowing that cgroups are hierarchical (i.e. limits of `/foo/bar` also 
apply to `/foo/bar/baz`), it seems likely that if there are any limits, then 
it'll be on `/user.slice/user-1000.slice` within the mounted controller. 
Unfortunately, I'm not able to reproduce this myself.

-

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-19 Thread Severin Gehwolf
On Thu, 19 May 2022 06:00:06 GMT, Ioi Lam  wrote:

>> Severin Gehwolf has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Refactor hotspot gtest
>>  - Separate into function. Fix comment.
>
> src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 86:
> 
>> 84:   const char* cgroup_p = cgroup_path;
>> 85:   int last_slash = find_last_slash_pos(root_p, cgroup_p);
>> 86:   assert(last_slash >= 0, "not an absolute path?");
> 
> Are root_p and cgroup_p directly read from the /proc/xxx files. If so, do we 
> validate the input to make sure they are absolute paths?
> 
> It seems like our code cannot handle trailing '/' in the input. If so, we 
> should clear all trailing '/' from the input string. Then, in functions that 
> process them, we should assert that they don't end with slash. See my comment 
> in find_last_slash_pos().

Yes, those values come from `/proc/self/mountinfo` and `/proc/self/cgroup`. 
There is no validation being done. Then again, we only end up in this branch if 
the root path is not a substring of the cgroup path. In that case trailing 
slashes don't matter, since there would not be a character by character match 
earlier.

I'll add handling of trailing slashes and appropriate asserts where it makes 
sense.

-

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-19 Thread Severin Gehwolf
On Thu, 19 May 2022 05:58:31 GMT, Ioi Lam  wrote:

>> Severin Gehwolf has updated the pull request incrementally with two 
>> additional commits since the last revision:
>> 
>>  - Refactor hotspot gtest
>>  - Separate into function. Fix comment.
>
> src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 102:
> 
>> 100:   for (int i = 0; *s1 == *s2 && *s1 != 0; i++) {
>> 101: if (*s1 == '/') {
>> 102:   last_matching_slash_pos = i;
> 
> I found the behavior a little hard to understand. Is it intentional?
> 
> 
> "/cat/cow", "/cat/dog"-> "/cat/"
> "/cat/","/cat/dog"-> "/cat/"
> "/cat", "/cat/dog"-> "/"
> 
> 
> The name `find_last_slash_pos` doesn't properly describe the behavior. I 
> thought about `find_common_path_prefix`, but that doesn't match the current 
> behavior (for the 3rd case, the common path prefix is `"/cat"`).

It's supposed to find the common path prefix as you say. I'll fix it. Open to 
suggestions to make it easier to understand.

-

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-19 Thread Ioi Lam
On Thu, 19 May 2022 09:06:18 GMT, Severin Gehwolf  wrote:

>> src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 63:
>> 
>>> 61:   } else {
>>> 62: char *p = strstr(cgroup_path, _root);
>>> 63: if (p != NULL && p == cgroup_path) {
>> 
>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
>> 
>> Maybe this block should be combined with the new `else` block you're adding? 
>> However, the behavior seems opposite between these two blocks of code:
>> 
>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
>> cgroup_path
>> 
>> 
>>   TestCase substring_match = {
>> "/sys/fs/cgroup/memory",   // mount_path
>> "/user.slice/user-1000.slice", // root_path
>> "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
>> "/sys/fs/cgroup/memory/user@1001.service"  // expected_path
>>   };
>> 
>> 
>> The lower block: The beginning of _root is a prefix of cgroup_path, we take 
>> the **head** of cgroup_path
>> 
>> 
>>  TestCase prefix_matched_cg = {
>> "/sys/fs/cgroup/memory",   // mount_path
>> "/user.slice/user-1000.slice/session-50.scope",// root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>>   };
>> 
>> 
>> I find the behavior hard to understand. I think the rules (and reasons) 
>> should be added to the comment block above the function.
>
>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
> 
> With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, 
> but then again it's a bit of a contrived example as those paths come from 
> `/proc` parsing. Anyway, this is code that got added with 
> [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not 
> something I've written and to be honest, I'm not sure this branch is needed, 
> but I didn't want to change the existing behaviour with this patch. I have no 
> more insight than you in terms of why that approach has been taken.
> 
>> Maybe this block should be combined with the new `else` block you're adding?
> 
> Maybe, but I'm not sure if it would break something.
> 
>> However, the behavior seems opposite between these two blocks of code:
>> 
>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
>> cgroup_path
>> 
>> ```
>>   TestCase substring_match = {
>> "/sys/fs/cgroup/memory",   // mount_path
>> "/user.slice/user-1000.slice", // root_path
>> "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
>> "/sys/fs/cgroup/memory/user@1001.service"  // expected_path
>>   };
>> ```
> 
> Yes. Though, I cannot comment on why that has been chosen. It's been there 
> since day one :-/
> 
>> The lower block: The beginning of _root is a prefix of cgroup_path, we take 
>> the **head** of cgroup_path
>> 
>> ```
>>  TestCase prefix_matched_cg = {
>> "/sys/fs/cgroup/memory",   // mount_path
>> "/user.slice/user-1000.slice/session-50.scope",// root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>>   };
>> ```
>> 
>> I find the behavior hard to understand. I think the rules (and reasons) 
>> should be added to the comment block above the function.
> 
> The reason why I've gone down the path of adding the head of cgroup_path is 
> because of this document (in conjunction to what the user was observing on an 
> affected system):
> https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> 
> The user was observing paths as listed in the test:
> 
> "/user.slice/user-1000.slice/session-50.scope",// root_path
> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
> 
> This very much looks like systemd managed. Given that and knowing that 
> systemd adds processes into `scopes` or `services` and groups them via 
> `slices` and also knowing that cgroups are hierarchical (i.e. limits of 
> `/foo/bar` also apply to `/foo/bar/baz`), it seems likely that if there are 
> any limits, then it'll be on `/user.slice/user-1000.slice` within the mounted 
> controller. Unfortunately, I'm not able to reproduce this myself.

I am wondering if the problem is this:

We have systemd running on the host, and a different copy of systemd that runs 
inside the container.

- They both set up `/user.slice/user-1000.slice/session-??.scope` within their 
own file systems
- For some reason, when you're looking inside the container, 
`/proc/self/cgroup` might use a path in the containerized file system whereas 
`/proc/self/mountinfo` uses a path in the host file system. These two paths may 
look alike but they have absolutely no relation to each other.

I have asked the reporter for more information:

https://gist.github.c

Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-19 Thread Ioi Lam
On Thu, 19 May 2022 19:59:18 GMT, Ioi Lam  wrote:

>>> What happens if cgroup_path is "/foo/bar" and _root is "/fo"?
>> 
>> With a mount path of `/bar` this ends up being `/bar/o/bar`. Pretty strange, 
>> but then again it's a bit of a contrived example as those paths come from 
>> `/proc` parsing. Anyway, this is code that got added with 
>> [JDK-8146115](https://bugs.openjdk.java.net/browse/JDK-8146115). It's not 
>> something I've written and to be honest, I'm not sure this branch is needed, 
>> but I didn't want to change the existing behaviour with this patch. I have 
>> no more insight than you in terms of why that approach has been taken.
>> 
>>> Maybe this block should be combined with the new `else` block you're adding?
>> 
>> Maybe, but I'm not sure if it would break something.
>> 
>>> However, the behavior seems opposite between these two blocks of code:
>>> 
>>> The upper block: _root is a prefix of cgroup_path, we take the **tail** of 
>>> cgroup_path
>>> 
>>> ```
>>>   TestCase substring_match = {
>>> "/sys/fs/cgroup/memory",   // mount_path
>>> "/user.slice/user-1000.slice", // root_path
>>> "/user.slice/user-1000.slice/user@1001.service",   // cgroup_path
>>> "/sys/fs/cgroup/memory/user@1001.service"  // expected_path
>>>   };
>>> ```
>> 
>> Yes. Though, I cannot comment on why that has been chosen. It's been there 
>> since day one :-/
>> 
>>> The lower block: The beginning of _root is a prefix of cgroup_path, we take 
>>> the **head** of cgroup_path
>>> 
>>> ```
>>>  TestCase prefix_matched_cg = {
>>> "/sys/fs/cgroup/memory",   // mount_path
>>> "/user.slice/user-1000.slice/session-50.scope",// root_path
>>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>>> "/sys/fs/cgroup/memory/user.slice/user-1000.slice" // expected_path
>>>   };
>>> ```
>>> 
>>> I find the behavior hard to understand. I think the rules (and reasons) 
>>> should be added to the comment block above the function.
>> 
>> The reason why I've gone down the path of adding the head of cgroup_path is 
>> because of this document (in conjunction to what the user was observing on 
>> an affected system):
>> https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
>> 
>> The user was observing paths as listed in the test:
>> 
>> "/user.slice/user-1000.slice/session-50.scope",// root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> 
>> This very much looks like systemd managed. Given that and knowing that 
>> systemd adds processes into `scopes` or `services` and groups them via 
>> `slices` and also knowing that cgroups are hierarchical (i.e. limits of 
>> `/foo/bar` also apply to `/foo/bar/baz`), it seems likely that if there are 
>> any limits, then it'll be on `/user.slice/user-1000.slice` within the 
>> mounted controller. Unfortunately, I'm not able to reproduce this myself.
>
> I am wondering if the problem is this:
> 
> We have systemd running on the host, and a different copy of systemd that 
> runs inside the container.
> 
> - They both set up `/user.slice/user-1000.slice/session-??.scope` within 
> their own file systems
> - For some reason, when you're looking inside the container, 
> `/proc/self/cgroup` might use a path in the containerized file system whereas 
> `/proc/self/mountinfo` uses a path in the host file system. These two paths 
> may look alike but they have absolutely no relation to each other.
> 
> I have asked the reporter for more information:
> 
> https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
> 
> Meanwhile, I think the current method of finding "which directory under 
> /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned 
> about, the path you get from  `/proc/self/cgroup`  and `/proc/self/mountinfo` 
> have no relation to each other, but we use them anyway to get our answer, 
> with many ad-hoc methods that are not documented in the code.
> 
> Maybe we should do this instead?
> 
> - Read /proc/self/cgroup
> - Find the `10:memory:` line
> - If `/sys/fs/cgroup/memory//tasks` contains my PID, this is the path
> - Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. Exactly 
> one of them contains my PID.
> 
> For example, here's a test with docker:
> 
> 
> INSIDE CONTAINER
> # cat /proc/self/cgroup | grep memory
> 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
> # cat /proc/self/mountinfo | grep memory
> 801 791 0:42 
> /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 
> /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup 
> cgroup rw,memory
> # cat 
> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
> cat: 
> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e30

Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-23 Thread Severin Gehwolf
On Thu, 19 May 2022 20:18:50 GMT, Ioi Lam  wrote:

>> I am wondering if the problem is this:
>> 
>> We have systemd running on the host, and a different copy of systemd that 
>> runs inside the container.
>> 
>> - They both set up `/user.slice/user-1000.slice/session-??.scope` within 
>> their own file systems
>> - For some reason, when you're looking inside the container, 
>> `/proc/self/cgroup` might use a path in the containerized file system 
>> whereas `/proc/self/mountinfo` uses a path in the host file system. These 
>> two paths may look alike but they have absolutely no relation to each other.
>> 
>> I have asked the reporter for more information:
>> 
>> https://gist.github.com/gaol/4d96eace8290e6549635fdc0ea41d0b4?permalink_comment_id=4172593#gistcomment-4172593
>> 
>> Meanwhile, I think the current method of finding "which directory under 
>> /sys/fs/cgroup/memory controls my memory usage" is broken. As mentioned 
>> about, the path you get from  `/proc/self/cgroup`  and 
>> `/proc/self/mountinfo` have no relation to each other, but we use them 
>> anyway to get our answer, with many ad-hoc methods that are not documented 
>> in the code.
>> 
>> Maybe we should do this instead?
>> 
>> - Read /proc/self/cgroup
>> - Find the `10:memory:` line
>> - If `/sys/fs/cgroup/memory//tasks` contains my PID, this is the path
>> - Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. Exactly 
>> one of them contains my PID.
>> 
>> For example, here's a test with docker:
>> 
>> 
>> INSIDE CONTAINER
>> # cat /proc/self/cgroup | grep memory
>> 10:memory:/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050
>> # cat /proc/self/mountinfo | grep memory
>> 801 791 0:42 
>> /docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050 
>> /sys/fs/cgroup/memory ro,nosuid,nodev,noexec,relatime master:23 - cgroup 
>> cgroup rw,memory
>> # cat 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> cat: 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks:
>>  No such file or directory
>> # cat /sys/fs/cgroup/memory/tasks | grep $$
>> 1
>> 
>> ON HOST
>> # cat 
>> /sys/fs/cgroup/memory/docker/40ea0ab8eaa0469d8d852b7f1d264b6a451a1c2fe20924cd2de874da5f2e3050/tasks
>> 37494
>> # cat /proc/37494/status | grep NSpid
>> NSpid:   37494   1
>
> Also, I think the current PR could produce the wrong answer, if systemd is 
> indeed running inside the container, and we have:
> 
> 
> "/user.slice/user-1000.slice/session-50.scope",// root_path
> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
> 
> 
> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which 
> specifies the overall memory limit for user-1000. However, the correct answer 
> may be /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, 
> which may have a smaller memory limit, and the JVM may end up allocating a 
> larger heap than allowed.

Yes, if we can decide which one the right file is. This is largely undocumented 
territory. The correct fix is a) find the correct path to the namespace 
hierarchy the process is a part of. b) starting at the leaf node, walk up the 
hierarchy and find the **lowest** limits. Doing this would be very expensive!

Aside: Current container detection in the JVM/JDK is notoriously imprecise. 
It's largely based on common setups (containers like docker). The heuristics 
assume that memory limits are reported inside the container at the leaf node. 
If, however, that's not the case, the detected limits will be wrong (it will 
detect it as unlimited, even though it's - for example - memory constrained at 
the parent). This can for example be reproduced on a cgroups v2 system with a 
systemd slice using memory limits. We've worked-around this in OpenJDK for 
cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338

-

PR: https://git.openjdk.java.net/jdk/pull/8629


Re: RFR: 8286212: Cgroup v1 initialization causes NPE on some systems [v3]

2022-05-23 Thread Severin Gehwolf
On Mon, 23 May 2022 09:24:19 GMT, Severin Gehwolf  wrote:

>> Also, I think the current PR could produce the wrong answer, if systemd is 
>> indeed running inside the container, and we have:
>> 
>> 
>> "/user.slice/user-1000.slice/session-50.scope",// root_path
>> "/user.slice/user-1000.slice/session-3.scope", // cgroup_path
>> 
>> 
>> The PR gives /sys/fs/cgroup/memory/user.slice/user-1000.slice/, which 
>> specifies the overall memory limit for user-1000. However, the correct 
>> answer may be 
>> /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-3.scope, which may 
>> have a smaller memory limit, and the JVM may end up allocating a larger heap 
>> than allowed.
>
> Yes, if we can decide which one the right file is. This is largely 
> undocumented territory. The correct fix is a) find the correct path to the 
> namespace hierarchy the process is a part of. b) starting at the leaf node, 
> walk up the hierarchy and find the **lowest** limits. Doing this would be 
> very expensive!
> 
> Aside: Current container detection in the JVM/JDK is notoriously imprecise. 
> It's largely based on common setups (containers like docker). The heuristics 
> assume that memory limits are reported inside the container at the leaf node. 
> If, however, that's not the case, the detected limits will be wrong (it will 
> detect it as unlimited, even though it's - for example - memory constrained 
> at the parent). This can for example be reproduced on a cgroups v2 system 
> with a systemd slice using memory limits. We've worked-around this in OpenJDK 
> for cgroups v1 by https://bugs.openjdk.java.net/browse/JDK-8217338

> Maybe we should do this instead?
> 
> * Read /proc/self/cgroup
> 
> * Find the `10:memory:` line
> 
> * If `/sys/fs/cgroup/memory//tasks` contains my PID, this is the 
> path
> 
> * Otherwise, scan all `tasks` files under  `/sys/fs/cgroup/memory/`. 
> Exactly one of them contains my PID.

Something like that seems most promising, but it would have to be 
`cgroup.procs` not `tasks` as `tasks` is the task id (i.e. Linux's thread), not 
the process. We could keep the two common cases as short circuiting. I.e. host 
and docker cases in the test.

-

PR: https://git.openjdk.java.net/jdk/pull/8629