[systemd-devel] Questions around cgroups, systemd, containers

2022-05-20 Thread Lewis Gaul
Hi all,

I've been trying to get a deeper understanding of Linux cgroups and their
use with containers/systemd over the last few months. I have a few
questions, but given the amount of context around the questions I've
written up my understanding in a blog post at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/ and the
questions in another blog post at
https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/.

If anyone has any thoughts/input/answers that would be much appreciated!
I'm planning on cross-posting in a few places such as podman/docker/kernel
mailing lists/communities, but in particular any input specific to the
systemd oriented questions would be great.

To summarize the questions (taken from the second post linked above):
- Why are private cgroups mounted read-only in non-privileged containers?
- Is it sound to override Docker’s mounting of the private container
cgroups under v1?
  - What are the concerns around the approach of passing '-v
/sys/fs/cgroup:/sys/fs/cgroup' in terms of the container’s view of its
cgroups?
  - Is modifying/replacing the cgroup mounts set up by the container engine
a reasonable workaround, or could this be fragile?
- When is it valid to manually manipulate container cgroups?
  - Do container managers such as Docker and Podman correctly delegate
cgroups on hosts running Systemd?
  - Are these container managers happy for the container to take ownership
of the container’s cgroup?
- Why are the container’s cgroup limits not set on a parent cgroup under
Docker/Podman?
  - Why doesn’t Docker use another layer of indirection in the cgroup
hierarchy such that the limit is applied in the parent cgroup to the
container?
- What happens if you have two of the same cgroup mount?
  - Are there any gotchas/concerns around manipulating cgroups via multiple
mount points?
- What’s the correct way to check which controllers are enabled?
  - What is it that determines which controllers are enabled? Is it kernel
configuration applied at boot?
  - Is it possible to have some controllers enabled for v1 at the same time
as others are enabled for v2?

Thanks in advance,
Lewis


Re: [systemd-devel] Questions around cgroups, systemd, containers

2022-05-21 Thread Lewis Gaul
ges required to modify (or unmount and recreate)
the cgroup mounts.

> > - When is it valid to manually manipulate container cgroups?
>
> When you asked for your own delegated subtree first, see docs:
> https://systemd.io/CGROUP_DELEGATION

Yep, I have read that multiple times, the following questions elaborate on
the point about whether container managers are considering the container
cgroups 'delegated' from their perspective and whether they're correctly
using systemd delegation. I realise this is probably more of a question for
docker/podman.

> >   - Do container managers such as Docker and Podman correctly delegate
cgroups on hosts running Systemd?
>
> podman probably does this correctly. docker didn't do, not sure if that
changed.

My guess is that this might relate to the container's 'cgroup
manager/driver' corresponding to podman's '--cgroup-manager=systemd' arg,
discussed at
https://www.lewisgaul.co.uk/blog/coding/2022/05/13/cgroups-intro/#cgroup-driver-options.
If so, I believe docker has switched back to 'systemd' being the default
under cgroups v2.

>   - Are these container managers happy for the container to take
ownership of the container’s cgroup?
>
> I am not sure I grok this question, but a correctly implemented container
manager should be able to safely run cgroups-using payloads inside the
container. In that model, a host systemd manages the root of the tree, the
container manager a cgroup further down, and the payload of the container
(for example another systemd run inside the container) the stuff below.

You have answered my question at least from the theoretical side, thanks,
this answer was what I had expected.

> > - Why are the container’s cgroup limits not set on a parent cgroup
under Docker/Podman?
>
> I don't grok the question?
>
> >   - Why doesn’t Docker use another layer of indirection in the
cgroup hierarchy such that the limit is applied in the parent cgroup to
the container?
>
> I don't understand the question. And I can't answer docker questions.

This is explained at
https://www.lewisgaul.co.uk/blog/coding/rough/2022/05/20/cgroups-questions/#why-are-the-containers-cgroup-limits-not-set-on-a-parent-cgroup-under-dockerpodman.
I'm basically questioning why a cgroup limit applied by e.g. 'docker run
--memory=2000' is applied in a cgroup that is made available
in/delegated to the container, such that the container is able to modify
its own limit (if it has write access). It feels like there's a missing
cgroup layer in this setup. If others agree with this assessment then I
would be happy to bring it up on the docker/podman issue trackers.

> > - What happens if you have two of the same cgroup mount?
>
> what do you mean by a "cgroup mount"? A cgroupfs controller mount? If
they are within the same cgroup namespace they will be effectively bind
mounts of each other, i.e. show the exact same contents.

Yes that's what I meant, and this confirms what I believed to be the case,
thanks.

> >   - Are there any gotchas/concerns around manipulating cgroups via
multiple mount points?
>
> Why would you do that though?

I'm not sure, I'm just trying to better understand how cgroups work and
what's going on when creating/manipulating cgroup mounts.

> > - What’s the correct way to check which controllers are enabled?
>
> enabled *in* *what*? in the kernel? /proc/cgroups. Mounted? "mount"
maybe? in your container mgr? depends on that.
>
> >   - What is it that determines which controllers are enabled? Is it
kernel configuration applied at boot?
>
> Enabled where?

I meant in the kernel, i.e. which controllers it's possible to create
mounts for and use.

> >   - Is it possible to have some controllers enabled for v1 at the same
time as others are enabled for v2?
>
> Yes.

Ah ok, that's interesting. So it's not technically possible to always be
able to say "the host's active cgroup version is {1,2}", it would have to
be on a per-controller basis such as "the cgroup memory controller is
enabled on version {1,2}"? In practice is this likely to be a case that's
encountered in the wild [on a host running systemd]?

Thanks,
Lewis

On Sat, 21 May 2022 at 08:48, Lennart Poettering 
wrote:

> On Fr, 20.05.22 17:12, Lewis Gaul (lewis.g...@gmail.com) wrote:
>
> > To summarize the questions (taken from the second post linked above):
> > - Why are private cgroups mounted read-only in non-privileged
> > containers?
>
> "private cgroups"? What do you mean by that? The controllers?
>
> Controller delegation on cgroupsv1 is simply not safe, that's all. You
> can provide invalid configuration to the kernel, and DoS the machine
> through it. cgroups 

[systemd-devel] Unable to check 'effective' cgroup limits

2022-06-09 Thread Lewis Gaul
Hi everyone,

[Disclaimer: cross posting from
https://github.com/containers/podman/discussions/14538]

Apologies that this is more of a Linux cgroup question than specific to
systemd, but I was wondering if someone here might be able to enlighten
me...

Two questions:

   - Why on cgroups v1 do the cpuset controller's
   cpuset.effective_{cpus,mems} seem to simply not work?
   - Is there any way to check effective cgroup memory or hugetlb limits?
   (cgroups v1 or v2)

Cpuset effective limits

root@ubuntu:~# podman run --rm -it --privileged -w /sys/fs/cgroup fedora
[root@7b9b67c7e1d4 cgroup]# mkdir cpuset/my-group
[root@7b9b67c7e1d4 cgroup]# cat cpuset/cpuset.cpus
0-5
[root@7b9b67c7e1d4 cgroup]# cat cpuset/my-group/cpuset.cpus

[root@7b9b67c7e1d4 cgroup]# cat cpuset/my-group/cpuset.effective_cpus

Expected cpuset/my-group/cpuset.effective_cpus to give 0-5 as set in the
parent cgroup. Works as expected on cgroups v2.

[root@7b9b67c7e1d4 cgroup]# echo 0-5 > cpuset/my-group/cpuset.cpus
[root@7b9b67c7e1d4 cgroup]# cat cpuset/my-group/cpuset.{effective_,}cpus
0-5
0-5
[root@7b9b67c7e1d4 cgroup]# echo 0-4 > cpuset/cpuset.cpus
bash: echo: write error: Device or resource busy

Didn't expect this to fail - shouldn't it automatically impose a stricter
limit on any child cgroups? Do I need to manually update all child cgroups
first?

[root@7b9b67c7e1d4 cgroup]# echo 0-4 > cpuset/my-group/cpuset.cpus
[root@7b9b67c7e1d4 cgroup]# cat cpuset/my-group/cpuset.{effective_,}cpus
0-4
0-4

Can impose a stricter limit on child cgroups, as expected.

[root@7b9b67c7e1d4 cgroup]# echo 0-4 > cpuset/cpuset.cpus
[root@7b9b67c7e1d4 cgroup]# echo 0-5 > cpuset/my-group/cpuset.cpus
bash: echo: write error: Permission denied

But can't relax the child's cgroup restriction (i.e. need awareness of CPU
restrictions already imposed above - how are you supposed to check this in
a private cgroup namespace?).
Memory/Hugetlb effective limits

On cgroups v1:

[root@7b9b67c7e1d4 cgroup]# ls memory/
cgroup.clone_children  memory.kmem.failcnt
memory.kmem.tcp.limit_in_bytes  memory.max_usage_in_bytes
memory.soft_limit_in_bytes  notify_on_release
cgroup.event_control   memory.kmem.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes  memory.move_charge_at_immigrate
memory.stat tasks
cgroup.procs   memory.kmem.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes  memory.numa_stat
memory.swappiness
memory.failcnt memory.kmem.slabinfo
memory.kmem.usage_in_bytes  memory.oom_control
memory.usage_in_bytes
memory.force_empty memory.kmem.tcp.failcnt
memory.limit_in_bytes   memory.pressure_level
memory.use_hierarchy

There is a memory.limit_in_bytes file, but no
memory.effective_limit_in_bytes to reflect parent cgroup restrictions.

Similarly on cgroups v2:

[root@0c0d71230663 cgroup]# ls memory.*
memory.current  memory.events.local  memory.low  memory.min
memory.oom.group  memory.stat  memory.swap.events
memory.swap.max
memory.events   memory.high  memory.max  memory.numa_stat
memory.pressure   memory.swap.current  memory.swap.high

There is a memory.max file, but not memory.max.effective (corresponding to
cpuset.cpus.effective).

I guess you could traverse up the cgroup hierarchy to find the smallest
limit being imposed... But this isn't possible inside a private cgroup
namespace. Is there any way to find the actual cgroup limit imposed?



Any insights welcome!


Thanks,

Lewis


[systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-09 Thread Lewis Gaul
Hi all,

I've come across an issue when restarting a systemd container, which I'm
seeing on a CentOS 8.2 VM but not able to reproduce on an Ubuntu 20.04 VM
(both cgroups v1).

The failure looks as follows, hitting the warning condition at
https://github.com/systemd/systemd/blob/v245/src/shared/cgroup-setup.c#L279:

[root@localhost ubuntu-systemd]# podman run -it --privileged --name ubuntu
--detach ubuntu-systemd
5e4ab2a36681c092f4ef937cf03b25a8d3d7b2fa530559bf4dac4079c84d0313

[root@localhost ubuntu-systemd]# podman restart ubuntu
5e4ab2a36681c092f4ef937cf03b25a8d3d7b2fa530559bf4dac4079c84d0313

[root@localhost ubuntu-systemd]# podman logs ubuntu | grep -B6 -A2 'Set
hostname'
systemd 245.4-4ubuntu3.19 running in system mode. (+PAM +AUDIT +SELINUX
+IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
default-hierarchy=hybrid)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.5 LTS!

Set hostname to <5e4ab2a36681>.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Created slice system-modprobe.slice.
--
systemd 245.4-4ubuntu3.19 running in system mode. (+PAM +AUDIT +SELINUX
+IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2
default-hierarchy=hybrid)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.5 LTS!

Set hostname to <5e4ab2a36681>.

*Failed to attach 1 to compat systemd cgroup
/machine.slice/libpod-5e4ab2a36681c092f4ef937cf03b25a8d3d7b2fa530559bf4dac4079c84d0313.scope/init.scope:
No such file or directory*[  OK  ] Created slice system-getty.slice.


If using docker instead of podman (still on CentOS 8.2) the container
actually exits after restart (when hitting the code at
https://github.com/systemd/systemd/blob/v245/src/core/cgroup.c#L2972):

[root@localhost ubuntu-systemd]# docker logs ubuntu | grep -C5 'Set
hostname'
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.5 LTS!

Set hostname to <523caa1f03e9>.
[  OK  ] Created slice system-getty.slice.
[  OK  ] Created slice system-modprobe.slice.
[  OK  ] Created slice User and Session Slice.
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.
--
Detected virtualization docker.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.5 LTS!

Set hostname to <523caa1f03e9>.




*Failed to attach 1 to compat systemd cgroup
/system.slice/docker-523caa1f03e9c96a6a12a55fb07df995c6e4b3a27e18585cbeda869b943ae728.scope/init.scope:
No such file or directoryFailed to open pin file: No such file or
directoryFailed to allocate manager object: No such file or
directory[!!] Failed to allocate manager object.Exiting PID 1...*


Does anyone know what might be causing this? Is it a systemd bug? I can
copy the info into a GitHub issue if that's helpful.

Thanks,
Lewis


Re: [systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-10 Thread Lewis Gaul
Following 'setenforce 0' I still see the same issue (I was also suspecting
SELinux!).

A few additional data points:
- this was not seen when using systemd v230 inside the container
- this is also seen on CentOS 8.4
- this is seen under docker even if the container's cgroup driver is
changed from 'cgroupfs' to 'systemd'

Thanks,
Lewis

On Tue, 10 Jan 2023 at 11:12, Lennart Poettering 
wrote:

> On Mo, 09.01.23 19:45, Lewis Gaul (lewis.g...@gmail.com) wrote:
>
> > Hi all,
> >
> > I've come across an issue when restarting a systemd container, which I'm
> > seeing on a CentOS 8.2 VM but not able to reproduce on an Ubuntu 20.04 VM
> > (both cgroups v1).
>
> selinux?
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


Re: [systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-10 Thread Lewis Gaul
I'm aware of the higher level of collaboration between podman and systemd
compared to docker, hence primarily raising this issue from a podman angle.

In privileged mode all mounts are read-write, so yes the container has
write access to the cgroup filesystem. (Podman also ensures write access to
the systemd cgroup subsystem mount in non-privileged mode by default).

On first boot PID 1 can be found in
/sys/fs/cgroup/systemd/machine.slice/libpod-.scope/init.scope/cgroup.procs,
whereas when the container restarts the 'init.scope/' directory does not
exist and PID 1 is instead found in the parent (container root) cgroup
/sys/fs/cgroup/systemd/machine.slice/libpod-.scope/cgroup.procs
(also reflected by /proc/1/cgroup). This is strange because systemd must be
the one to create this cgroup dir in the initial boot, so I'm not sure why
it wouldn't on subsequent boot?

I can confirm that the container has permissions since executing a 'mkdir'
in /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/ inside the
container succeeds after the restart, so I have no idea why systemd is not
creating the 'init.scope/' dir. I notice that inside the container's
systemd cgroup mount 'system.slice/' does exist, but 'user.slice/' also
does not (both exist on normal boot). Is there any way I can find systemd
logs that might indicate why the cgroup dir creation is failing?

One final datapoint: the same is seen when using a private cgroup namespace
(via 'podman run --cgroupns=private'), although then the error is then, as
expected, "Failed to attach 1 to compat systemd cgroup /init.scope: No such
file or directory".

I could raise this with the podman team, but it seems more in the systemd
area given it's a systemd warning and I would expect systemd to be creating
this cgroup dir?

Thanks,
Lewis

On Tue, 10 Jan 2023 at 14:48, Lennart Poettering 
wrote:

> On Di, 10.01.23 13:18, Lewis Gaul (lewis.g...@gmail.com) wrote:
>
> > Following 'setenforce 0' I still see the same issue (I was also
> suspecting
> > SELinux!).
> >
> > A few additional data points:
> > - this was not seen when using systemd v230 inside the container
> > - this is also seen on CentOS 8.4
> > - this is seen under docker even if the container's cgroup driver is
> > changed from 'cgroupfs' to 'systemd'
>
> docker is garbage. They are hostile towards running systemd inside
> containers.
>
> podman upstream is a lot friendly, and apparently what everyone in OCI
> is going towards these days.
>
> I have not much experience with podman though, and in particular not
> old versions. Next step would probably be to look at what precisely
> causes the permission issue, via strace.
>
> but did you make sure your container actually gets write access to the
> cgroup trees?
>
> anyway, i'd recommend asking the podman community for help about this.
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


Re: [systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-10 Thread Lewis Gaul
I omitted one piece of information about running with --cgroupns=private
thinking it was unrelated, but actually it appears maybe it is related (and
perhaps highlights a variant of the issue that is seen on first-boot, not
only on container restart). Again (and what makes me think it's related), I
can reproduce this on a Centos host but not on Ubuntu (still with SELinux
in 'permissive' mode).

[root@localhost ~]# podman run -it --name ubuntu --privileged --cgroupns
private ubuntu-systemd
systemd 245.4-4ubuntu3.19 running in system mode. (+PAM +AUDIT +SELINUX
+IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
+XZ +LZ4 +SECCOMP +BLKID +ELFUTI)
Detected virtualization podman.
Detected architecture x86-64.

Welcome to Ubuntu 20.04.5 LTS!

Set hostname to .


*Couldn't move remaining userspace processes, ignoring: Input/output
errorFailed to create compat systemd cgroup /system.slice: No such file or
directoryFailed to create compat systemd cgroup
/system.slice/system-getty.slice: No such file or directory*
[  OK  ] Created slice system-getty.slice.

*Failed to create compat systemd cgroup
/system.slice/system-modprobe.slice: No such file or directory*[  OK  ]
Created slice system-modprobe.slice.

*Failed to create compat systemd cgroup /user.slice: No such file or
directory*[  OK  ] Created slice User and Session Slice.
[  OK  ] Started Dispatch Password Requests to Console Directory Watch.
[  OK  ] Started Forward Password Requests to Wall Directory Watch.

This first warning is coming from one of the same areas of code I linked in
my first email:
https://github.com/systemd/systemd/blob/v245/src/core/cgroup.c#L2967.

I see the same thing with '--cap-add sys_admin' instead of '--privileged',
and again seen with both docker and podman.

Thanks,
Lewis

On Tue, 10 Jan 2023 at 15:28, Lewis Gaul  wrote:

> I'm aware of the higher level of collaboration between podman and systemd
> compared to docker, hence primarily raising this issue from a podman angle.
>
> In privileged mode all mounts are read-write, so yes the container has
> write access to the cgroup filesystem. (Podman also ensures write access to
> the systemd cgroup subsystem mount in non-privileged mode by default).
>
> On first boot PID 1 can be found in
> /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/init.scope/cgroup.procs,
> whereas when the container restarts the 'init.scope/' directory does not
> exist and PID 1 is instead found in the parent (container root) cgroup
> /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/cgroup.procs
> (also reflected by /proc/1/cgroup). This is strange because systemd must be
> the one to create this cgroup dir in the initial boot, so I'm not sure why
> it wouldn't on subsequent boot?
>
> I can confirm that the container has permissions since executing a 'mkdir'
> in /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/ inside the
> container succeeds after the restart, so I have no idea why systemd is not
> creating the 'init.scope/' dir. I notice that inside the container's
> systemd cgroup mount 'system.slice/' does exist, but 'user.slice/' also
> does not (both exist on normal boot). Is there any way I can find systemd
> logs that might indicate why the cgroup dir creation is failing?
>
> One final datapoint: the same is seen when using a private cgroup
> namespace (via 'podman run --cgroupns=private'), although then the error is
> then, as expected, "Failed to attach 1 to compat systemd cgroup
> /init.scope: No such file or directory".
>
> I could raise this with the podman team, but it seems more in the systemd
> area given it's a systemd warning and I would expect systemd to be creating
> this cgroup dir?
>
> Thanks,
> Lewis
>
> On Tue, 10 Jan 2023 at 14:48, Lennart Poettering 
> wrote:
>
>> On Di, 10.01.23 13:18, Lewis Gaul (lewis.g...@gmail.com) wrote:
>>
>> > Following 'setenforce 0' I still see the same issue (I was also
>> suspecting
>> > SELinux!).
>> >
>> > A few additional data points:
>> > - this was not seen when using systemd v230 inside the container
>> > - this is also seen on CentOS 8.4
>> > - this is seen under docker even if the container's cgroup driver is
>> > changed from 'cgroupfs' to 'systemd'
>>
>> docker is garbage. They are hostile towards running systemd inside
>> containers.
>>
>> podman upstream is a lot friendly, and apparently what everyone in OCI
>> is going towards these days.
>>
>> I have not much experience with podman though, and in particular not
>> old versions. Next step would probably be to look at what precisely
>> causes the permission issue, via strace.
>>
>> but did you make sure your container actually gets write access to the
>> cgroup trees?
>>
>> anyway, i'd recommend asking the podman community for help about this.
>>
>> Lennart
>>
>> --
>> Lennart Poettering, Berlin
>>
>


Re: [systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-12 Thread Lewis Gaul
Hey Michal,

Thanks for the reply.

> I'd suggest looking at debug level logs from the hosts systemd around the
time of the container restart.

Could you suggest commands to run to do this?

> What is the host's systemd version and cgroup mode
(legacy,hybrid,unified)? (I'm not sure what the distros in your original
message referred to.)

The issue has been seen on Centos 8.2 and 8.4 host distro, but not seen on
Ubuntu 20.04. The former has systemd v239 and appears to be in 'legacy'
cgroup mode (no /sys/fs/cgroup/unified cgroup2 mount), whereas the latter
has systemd v245 and is in what I believe you'd refer to as 'hybrid' mode
(with the /sys/fs/cgroup/unified cgroup2 mount).

Should we be suspicious of the host systemd version and/or the fact that
the host is in 'legacy' mode while the container (based on the systemd
version being higher) is in 'hybrid' mode? Maybe we should try telling the
container systemd to run in 'legacy' mode somehow?

Thanks,
Lewis

On Thu, 12 Jan 2023 at 13:12, Michal Koutný  wrote:

> Hello.
>
> On Tue, Jan 10, 2023 at 03:28:04PM +, Lewis Gaul 
> wrote:
> > I can confirm that the container has permissions since executing a
> 'mkdir'
> > in /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/ inside the
> > container succeeds after the restart, so I have no idea why systemd is
> not
> > creating the 'init.scope/' dir.
>
> It looks like it could also be a race/deferred impact from host's systemd.
>
> > I notice that inside the container's systemd cgroup mount
> > 'system.slice/' does exist, but 'user.slice/' also does not (both
> > exist on normal boot). Is there any way I can find systemd logs that
> > might indicate why the cgroup dir creation is failing?
>
> I'd suggest looking at debug level logs from the hosts systemd around
> the time of the container restart.
>
>
> > I could raise this with the podman team, but it seems more in the systemd
> > area given it's a systemd warning and I would expect systemd to be
> creating
> > this cgroup dir?
>
> What is the host's systemd version and cgroup mode
> (legacy,hybrid,unified)? (I'm not sure what the distros in your original
> message referred to.)
>
>
> Thanks,
> Michal
>


Re: [systemd-devel] Container restart issue: Failed to attach 1 to compat systemd cgroup

2023-01-12 Thread Lewis Gaul
Another data point: I can reproduce on Ubuntu 18.04 host which has systemd
v237 in *hybrid* cgroup mode (assuming I've understood the definition of
hybrid, as per my previous email). So it's looking like it might be an
issue with interoperation between host and container systemd, introduced
somewhere between v239 and v245 for host systemd when the container is
running v245 (also seen with v244 and v249).

Thanks,
Lewis

On Thu, 12 Jan 2023 at 15:31, Lewis Gaul  wrote:

> Hey Michal,
>
> Thanks for the reply.
>
> > I'd suggest looking at debug level logs from the hosts systemd around
> the time of the container restart.
>
> Could you suggest commands to run to do this?
>
> > What is the host's systemd version and cgroup mode
> (legacy,hybrid,unified)? (I'm not sure what the distros in your original
> message referred to.)
>
> The issue has been seen on Centos 8.2 and 8.4 host distro, but not seen on
> Ubuntu 20.04. The former has systemd v239 and appears to be in 'legacy'
> cgroup mode (no /sys/fs/cgroup/unified cgroup2 mount), whereas the latter
> has systemd v245 and is in what I believe you'd refer to as 'hybrid' mode
> (with the /sys/fs/cgroup/unified cgroup2 mount).
>
> Should we be suspicious of the host systemd version and/or the fact that
> the host is in 'legacy' mode while the container (based on the systemd
> version being higher) is in 'hybrid' mode? Maybe we should try telling the
> container systemd to run in 'legacy' mode somehow?
>
> Thanks,
> Lewis
>
> On Thu, 12 Jan 2023 at 13:12, Michal Koutný  wrote:
>
>> Hello.
>>
>> On Tue, Jan 10, 2023 at 03:28:04PM +, Lewis Gaul <
>> lewis.g...@gmail.com> wrote:
>> > I can confirm that the container has permissions since executing a
>> 'mkdir'
>> > in /sys/fs/cgroup/systemd/machine.slice/libpod-.scope/ inside
>> the
>> > container succeeds after the restart, so I have no idea why systemd is
>> not
>> > creating the 'init.scope/' dir.
>>
>> It looks like it could also be a race/deferred impact from host's systemd.
>>
>> > I notice that inside the container's systemd cgroup mount
>> > 'system.slice/' does exist, but 'user.slice/' also does not (both
>> > exist on normal boot). Is there any way I can find systemd logs that
>> > might indicate why the cgroup dir creation is failing?
>>
>> I'd suggest looking at debug level logs from the hosts systemd around
>> the time of the container restart.
>>
>>
>> > I could raise this with the podman team, but it seems more in the
>> systemd
>> > area given it's a systemd warning and I would expect systemd to be
>> creating
>> > this cgroup dir?
>>
>> What is the host's systemd version and cgroup mode
>> (legacy,hybrid,unified)? (I'm not sure what the distros in your original
>> message referred to.)
>>
>>
>> Thanks,
>> Michal
>>
>


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2023-07-19 Thread Lewis Gaul
Hi Lennart, all,

TL;DR: A container making use of cgroup controllers must use the same
cgroup version as the host, and in the case of it being a systemd container
on an arbitrary host then a lack of cgroup v1 support from systemd would
place a cgroup v2 requirement on the host, which is an undesirable property
of a container.

I can totally understand the desire to simplify the codebase/support
matrix, and appreciate this response is coming quite late (almost a year
since cgroups v1 was noted as a future deprecation in systemd). However, I
wanted to share a use-case/argument for keeping cgroups v1 support a little
longer in case it may impact the decision at all.

At my $work we provide a container image to customers, where the container
runs using systemd as the init system. The end-user has some freedom on
how/where to run this container, e.g. using docker/podman on a host of
their choice, or in Kubernetes (e.g. EKS in AWS).

Of course there are bounds on what we officially support, but generally we
would like to support recent LTS releases of major distros, currently
including Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, Amazon Linux 2 (EKS
doesn’t yet support Amazon Linux 2023). Of these, only Ubuntu 22.04 and
RHEL 9 have switched to using cgroups v2 by default, and we are not in a
position to require the end-user to reconfigure their host to enable
running our container. What’s more, since we make use of cgroup controllers
inside the container, we cannot have cgroup v1 controllers enabled on the
host while attempting to use cgroups v2 inside the container.

> Because of that I see no reason why old systemd cgroupv1 payloads
> shouldn#t just work on cgroupv2 hosts: as long as you give them a
> pre-set-up cgroupv1 environemnt, and nothing stops you from doing
> that. In fact, this is something we even documented somewhere: what to
> do if the host only does a subset of the cgroup stuff you want, and
> what you have to do to set up the other stuff (i.e. if host doesn't
> manage your hierarchy of choice, but only others, just follow the same
> structure in the other hierarchy, and clean up after yourself). This
> is what nspawn does: if host is cgroupv2 only it will set up
> name=systemd hierarchy in cgroupv1 itself, and pass that to the
> container.

I don't think this works for us since we need the full cgroup
(v1/v2) filesystem available in the container, with controllers enabled.

This means that we must, for now, continue to support cgroups v1 in our
container image. If systemd were to drop support for cgroups v1 then we may
find ourselves in an awkward position of not being able to upgrade to this
new systemd version, or be forced to pass this restriction on to end-users.
The reason we’re uncomfortable about insisting on the use of cgroups v2 is
that as a container app we ideally wouldn’t place such requirements on the
host.

So, while it's true that the container ecosystem does now largely support
cgroups v2, there is still an aspect of caring about what the host is
running, which from our perspective this should be assumed to be the
default configuration for the chosen distro. With this in mind, we’d
ideally like to have systemd support cgroups v1 a little longer than the
end of this year.

Does this make sense as a use-case and motivation for wanting new systemd
versions to continue supporting cgroups v1? Of course not forever, but
until there are less hosts out there using cgroups v1.

Best wishes,
Lewis

On Fri, 22 Jul 2022 at 11:15, Lennart Poettering 
wrote:

> On Do, 21.07.22 16:24, Stéphane Graber (stgra...@ubuntu.com) wrote:
>
> > Hey there,
> >
> > I believe Christian may have relayed some of this already but on my
> > side, as much as I can sympathize with the annoyance of having to
> > support both cgroup1 and cgroup2 side by side, I feel that we're sadly
> > nowhere near the cut off point.
> >
> > >From what I can gather from various stats we have, over 90% of LXD
> > users are still on distributions relying on CGroup1.
> > That's because most of them are using LTS releases of server
> > distributions and those only somewhat recently made the jump to
> > cgroup2:
> >  - RHEL 9 in May 2022
> >  - Ubuntu 22.04 LTS in April 2022
> >  - Debian 11 in August 2021
> >
> > OpenSUSE is still on cgroup1 by default in 15.4 for some reason.
> > All this is also excluding our two largest users, Chromebooks and QNAP
> > NASes, neither of them made the switch yet.
>
> At some point I feel no sympathy there. If google/qnap/suse still are
> stuck in cgroupv1 land, then that's on them, we shouldn't allow
> ourselves to be held hostage by that.
>
> I mean, that Google isn't forward looking in these things is well
> known, but I am a bit surprised SUSE is still so far back.
>
> > I honestly wouldn't be holding deprecating cgroup1 on waiting for
> > those few to wake up and transition.
> > Both ChromeOS and QNAP can very quickly roll it out to all their users
> > should they want to.
> > It's a bit trickie

Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2023-07-19 Thread Lewis Gaul
Hi Luca,

> All the distributions you quoted above support cgroupv2 to the best of
> my knowledge, it simply has to be enabled at boot. Why isn't that
> sufficient?

As I said in my previous email:

> in the case of it being a systemd container on an arbitrary host then a
lack of cgroup v1 support from systemd would place a cgroup v2 requirement
on the host, which is an undesirable property of a container.

and

> we are not in a position to require the end-user to reconfigure their
host to enable running our container.

Regards,
Lewis

On Wed, 19 Jul 2023 at 11:35, Luca Boccassi  wrote:

> On Wed, 19 Jul 2023 at 11:30, Lewis Gaul  wrote:
> >
> > Hi Lennart, all,
> >
> > TL;DR: A container making use of cgroup controllers must use the same
> cgroup version as the host, and in the case of it being a systemd container
> on an arbitrary host then a lack of cgroup v1 support from systemd would
> place a cgroup v2 requirement on the host, which is an undesirable property
> of a container.
> >
> > I can totally understand the desire to simplify the codebase/support
> matrix, and appreciate this response is coming quite late (almost a year
> since cgroups v1 was noted as a future deprecation in systemd). However, I
> wanted to share a use-case/argument for keeping cgroups v1 support a little
> longer in case it may impact the decision at all.
> >
> > At my $work we provide a container image to customers, where the
> container runs using systemd as the init system. The end-user has some
> freedom on how/where to run this container, e.g. using docker/podman on a
> host of their choice, or in Kubernetes (e.g. EKS in AWS).
> >
> > Of course there are bounds on what we officially support, but generally
> we would like to support recent LTS releases of major distros, currently
> including Ubuntu 20.04, Ubuntu 22.04, RHEL 8, RHEL 9, Amazon Linux 2 (EKS
> doesn’t yet support Amazon Linux 2023). Of these, only Ubuntu 22.04 and
> RHEL 9 have switched to using cgroups v2 by default, and we are not in a
> position to require the end-user to reconfigure their host to enable
> running our container. What’s more, since we make use of cgroup controllers
> inside the container, we cannot have cgroup v1 controllers enabled on the
> host while attempting to use cgroups v2 inside the container.
> >
> > > Because of that I see no reason why old systemd cgroupv1 payloads
> > > shouldn#t just work on cgroupv2 hosts: as long as you give them a
> > > pre-set-up cgroupv1 environemnt, and nothing stops you from doing
> > > that. In fact, this is something we even documented somewhere: what to
> > > do if the host only does a subset of the cgroup stuff you want, and
> > > what you have to do to set up the other stuff (i.e. if host doesn't
> > > manage your hierarchy of choice, but only others, just follow the same
> > > structure in the other hierarchy, and clean up after yourself). This
> > > is what nspawn does: if host is cgroupv2 only it will set up
> > > name=systemd hierarchy in cgroupv1 itself, and pass that to the
> > > container.
> >
> > I don't think this works for us since we need the full cgroup (v1/v2)
> filesystem available in the container, with controllers enabled.
> >
> > This means that we must, for now, continue to support cgroups v1 in our
> container image. If systemd were to drop support for cgroups v1 then we may
> find ourselves in an awkward position of not being able to upgrade to this
> new systemd version, or be forced to pass this restriction on to end-users.
> The reason we’re uncomfortable about insisting on the use of cgroups v2 is
> that as a container app we ideally wouldn’t place such requirements on the
> host.
> >
> > So, while it's true that the container ecosystem does now largely
> support cgroups v2, there is still an aspect of caring about what the host
> is running, which from our perspective this should be assumed to be the
> default configuration for the chosen distro. With this in mind, we’d
> ideally like to have systemd support cgroups v1 a little longer than the
> end of this year.
> >
> > Does this make sense as a use-case and motivation for wanting new
> systemd versions to continue supporting cgroups v1? Of course not forever,
> but until there are less hosts out there using cgroups v1.
>
> All the distributions you quoted above support cgroupv2 to the best of
> my knowledge, it simply has to be enabled at boot. Why isn't that
> sufficient?
>
> Kind regards,
> Luca Boccassi
>


Re: [systemd-devel] Feedback sought: can we drop cgroupv1 support soon?

2023-08-18 Thread Lewis Gaul
> What's stopping you from mounting a private "named" cgroup v1
> hierarchy to such containers (i.e. no controllers). systemd will then
> use that when taking over and not bother with mounting anything on its
> own, such as a cgroupv2 tree.

We specifically want to be able to make use of cgroup controllers within
the container. One example of this would be to use "MemoryLimit" (cgroupv1)
for a systemd unit (I understand this is deprecated in the latest versions
of systemd, but as far as I can see we wouldn't be able to use the cgroupv2
"MemoryMax" config in this scenario anyway).

> You are doing something half broken and
> outside of the intended model already, I am not sure we need to go the
> extra mile to support this for longer.

I'm slightly surprised and disheartened by this viewpoint. I have paid
close attention to https://systemd.io/CONTAINER_INTERFACE/ and
https://systemd.io/CGROUP_DELEGATION/, and I'd interpreted the statement as
being that running systemd in a container should be fully supported (not
only on cgroupsv2, at least using recent-but-not-latest systemd versions).

In particular, the following:

"Note that it is our intention to make systemd systems work flawlessly and
out-of-the-box in containers. In fact, we are interested to ensure that the
same OS image can be booted on a bare system, in a VM and in a container,
and behave correctly each time. If you notice that some component in
systemd does not work in a container as it should, even though the
container manager implements everything documented above, please contact
us."

"When systemd runs as container payload it will make use of all hierarchies
it has write access to. For legacy mode you need to make at least
/sys/fs/cgroup/systemd/ available, all other hierarchies are optional."

I note that point 6 under "Some Don'ts" does correlate with what you're
saying:
"Think twice before delegating cgroup v1 controllers to less privileged
containers. It’s not safe, you basically allow your containers to freeze
the system with that and worse."
However, in our case we're talking about a privileged container, so this
doesn't really apply.

I think there's a definite use-case here, and unfortunately when systemd
drops support for cgroupsv1 I think this will just mean we'll be unable to
upgrade the container's systemd version until all relevant hosts use
cgroupsv2 by default (probably a couple of years away).

Thanks for your time,
Lewis

On Mon, 7 Aug 2023 at 17:26, Lennart Poettering 
wrote:

> On Do, 20.07.23 01:59, Dimitri John Ledkov (dimitri.led...@canonical.com)
> wrote:
>
> > Some deployments that switch back their modern v2 host to hybrid or v1,
> are
> > the ones that need to run old workloads that contain old systemd. Said
> old
> > systemd only has experimental incomplete v2 support that doesn't work
> with
> > v2-only (the one before current stable magick mount value).
>
> What's stopping you from mounting a private "named" cgroup v1
> hierarchy to such containers (i.e. no controllers). systemd will then
> use that when taking over and not bother with mounting anything on its
> own, such as a cgroupv2 tree.
>
> that should be enough to make old systemd happy.
>
> Lennart
>
> --
> Lennart Poettering, Berlin
>


[systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lewis Gaul
Hi systemd team,

I've encountered an issue when running systemd inside a container using
cgroups v2, where if a container exec process is created at the wrong
moment during early startup then systemd will fail to move all processes
into a child cgroup, and therefore fail to enable controllers due to the
"no internal processes" rule introduced in cgroups v2. In other words, a
systemd container is started and very soon after a process is created via
e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
container's namespaces (although not a child of the container's PID 1).
This is not a totally crazy thing to be doing - this was hit when testing a
systemd container, using a container exec "probe" to check when the
container is ready.

More precisely, the problem manifests as follows (in
https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676
):
- Container exec processes are placed in the container's root cgroup by
default, but if this fails (due to the "no internal processes" rule) then
container PID 1's cgroup is used (see
https://github.com/opencontainers/runc/issues/2356).
- At systemd startup, systemd tries to create the init.scope cgroup and
move all processes into it.
- If a container exec process is created after finding procs to move and
moving them but before enabling controllers then the exec process will be
placed in the root cgroup.
- When systemd then tries to enable controllers via subtree_control in the
container's root cgroup, this fails because the exec process is in that
cgroup.

The root of the problem here is that moving processes out of a cgroup and
enabling controllers (such that new processes cannot be created there) is
not an atomic operation, meaning there's a window where a new process can
get in the way. One possible solution/workaround in systemd would be to
retry under this condition. Or perhaps this should be considered a bug in
the container runtimes?

I have some tests exercising systemd containers at
https://github.com/LewisGaul/systemd-containers which are able to reproduce
this issue on a cgroups v2 host (in testcase
tests/test_exec_procs.py::test_exec_proc_spam):

(venv) root@ubuntu:~/systemd-containers# pytest --log-cli-level debug -k
exec_proc_spam --cgroupns private --setup-modes default --container-exe
podman
INFO tests.conftest:conftest.py:474 Running container image
localhost/ubuntu-systemd:20.04 with args: entrypoint=, command=['bash',
'-c', 'sleep 1 && exec /sbin/init'], cap_add=['sys_admin'], systemd=always,
tty=True, interactive=True, detach=True, remove=False, cgroupns=private,
name=systemd-tests-1695981045.12
DEBUGtests.test_exec_procs:test_exec_procs.py:106 Got PID 1 cgroups:
0::/init.scope
DEBUGtests.test_exec_procs:test_exec_procs.py:111 Got exec proc 3
cgroups:
0::/init.scope
DEBUGtests.test_exec_procs:test_exec_procs.py:111 Got exec proc 21
cgroups:
0::/
DEBUGtests.test_exec_procs:test_exec_procs.py:114 Enabled controllers:
set()
=
short test summary info
=
FAILED
tests/test_exec_procs.py::test_exec_proc_spam[private-unified-default] -
AssertionError: assert set() >= {'memory', 'pids'}

Does anyone have any thoughts on this? Should this be considered a systemd
bug, or is it at least worth adding in some explicitly handling for this?
Is there something container runtimes are doing wrong here from the
perspective of systemd?

Thanks,
Lewis


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lewis Gaul
>  Wouldn't it be better to have the container inform the host via
NOTIFY_SOCKET (the Type=notify mechanism)? I believe systemd has had
support for sending readiness notifications from init to a container
manager for quite a while.

> Use the notify socket and you'll get a notification back when the
container is ready, without having to inject anything

To be clear, I'm not looking for alternative solutions for my specific
example, I was raising the general architectural issue.

On Fri, 29 Sept 2023 at 12:06, Luca Boccassi 
wrote:

> On Fri, 29 Sept 2023 at 12:00, Lewis Gaul  wrote:
> >
> > Hi systemd team,
> >
> > I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID 1).
> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>
> Use the notify socket and you'll get a notification back when the
> container is ready, without having to inject anything
>