Re: [systemd-devel] Antw: Re: [systemd‑devel] [EXT] Proposal to extend os‑release/machine‑info with field PREFER_HARDENED_CONFIG
To what extent a machine is locked down is a policy choice. There are already loads of tools available to manage policy so this really doesn't belong here and if you want to ensure that your fleet of machines are locked down through something like PREFER_HARDENED_CONFIG=1, you're going to need tools to manage *that* anyway. Then why not use the same tool(s) to simply manage the machines? And what exactly should it do? I'm sorry, but what is "it" in this context? Also: Do you really believe in "one size fits all" security-wise? Of course not. I think distributions should be providing sane defaults and everything else is a policy decision that whoever is responsible for a particular machine would then implement using one of the many tools that already exist. If (at all), then the parameter should be "SECURITY_POLICY=name" (where name is one of the predefined policies). One of the ideas behind the systemd project was to provide plumbing for all distributions that would provide some level of standardization and each distribution not having to reinvent the wheel. Introducing something like SECURITY_POLICY=woot which inevitably would mean different things from distribution to distribution and even from package to package within a distribution doesn't seem like it would further that goal. And most of all, selecting a different policy does not make it a different OS. For sure, but I don't quite see which point you're trying to make.
[systemd-devel] DeviceAllow=/dev/net/tun in systemd-nspawn@.service has no effect
Hello, Just out of curiosity, I commented out DeviceAllow=/dev/net/tun rwm in systemd-nspawn@.service and tried running. A failure was expected, but it was not. copy_devnodes() in src/nspawn/nspawn.c executes mknod() on /dev/net/tun, EPERM is expected because DeviceAllow=/dev/net/tun rwm does not exist. But /dev/net/tun was created and systemd-nspawn was not failed. Doesn't DeviceAllow= apply to child processes spawned by raw_clone(SIGCHLD|CLONE_NEWNS) or any other reasons? I'm using arch linux, kernel is 5.16.10 and systemd is 250.3. Here is the output. I also commented out DeviceAllow=char-pts rw and it didn't fail: sh-5.1# tail -n 20 /usr/lib/systemd/system/systemd-nspawn\@.service TasksMax=16384 WatchdogSec=3min DevicePolicy=closed #DeviceAllow=/dev/net/tun rwm #DeviceAllow=char-pts rw # nspawn itself needs access to /dev/loop-control and /dev/loop, to implement # the --image= option. Add these here, too. DeviceAllow=/dev/loop-control rw DeviceAllow=block-loop rw DeviceAllow=block-blkext rw # nspawn can set up LUKS encrypted loopback files, in which case it needs # access to /dev/mapper/control and the block devices /dev/mapper/*. DeviceAllow=/dev/mapper/control rw DeviceAllow=block-device-mapper rw [Install] WantedBy=machines.target sh-5.1# systemctl start systemd-nspawn@test sh-5.1# machinectl MACHINE CLASS SERVICEOS VERSION ADDRESSES testcontainer systemd-nspawn arch - - 1 machines listed. sh-5.1# machinectl shell test Connected to machine test. Press ^] three times within 1s to exit session. [root@test ~]# ls -l /dev/net/tun crw-rw-rw- 1 root root 10, 200 Feb 20 05:13 /dev/net/tun Regards, Gibeom Gwon
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Am Montag, dem 21.02.2022 um 22:16 +0100 schrieb Felip Moll: > Silvio, > > As I commented in my previous post, creating every single job in a > separate slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal > structures and manage it all is a no-no. And how about an xinitd style daemon, excepting connections and spawning processes that way? So instead of sgamba1.service you would have a sgamba1@.service and a sgamba1.socket, spawning sgamba1@user1.service, sgamba1@user2.service, etc. units. So even if one user process dies, nothing else dies. And the setup overhead would only be once everytime a user creates a new connection. So they can still drop their one million jobs and you has still user isolation.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> On 21 Feb 2022, at 21:16, Felip Moll wrote: > > > >> You could invoke a man:systemd-run for each new process. Than you can >> put every single job in a seperate .slice with its own >> man:systemd.resource-control applied. >> This would also mean that you don't need to compile against libsystemd. >> Just exec() accordingly if a systemd-system is detected. >> >> BR >> Silvio > > Silvio, > > As I commented in my previous post, creating every single job in a separate > slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal structures and > manage it all is a no-no. Are you assuming this or did you measure the cost? Barry > > One other option that I am thinking about is extending the parameters of a > unit file, for example adding a DelegateCgroupLeaf=yes option. > > DelegateCgroupLeaf=. If set to yes an extra directory will be created > into the unit cgroup to place the newly spawned service process. This is > useful for services which need to be restarted while its forked pids remain > in the cgroup and the service cgroup is not a leaf anymore. This option is > only valid when using Delegate=yes and under a system in unified mode. > > E.g. in my example, that would end up like this: > /sys/fs/cgroup/system.slices/sgamba1.service <-- This is Delegated=yes > DelegateMultiCgroups=yes > ├── sgamba1 <-- The spawned process would be always put in here by > systemd. > ├── user1_stuff > ├── user2_stuff > └── user3_stuff > > I think this idea could work for cases like the one exposed here, and I see > this would be quite useful. >
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
You could invoke a man:systemd-run for each new process. Than you can > put every single job in a seperate .slice with its own > man:systemd.resource-control applied. > This would also mean that you don't need to compile against libsystemd. > Just exec() accordingly if a systemd-system is detected. > > BR > Silvio > Silvio, As I commented in my previous post, creating every single job in a separate slice is an overhead I cannot assume. An HTC system could run thousands of jobs per second, and doing extra fork+execs plus waiting for systemd to fill up its internal structures and manage it all is a no-no. One other option that I am thinking about is extending the parameters of a unit file, for example adding a DelegateCgroupLeaf=yes option. DelegateCgroupLeaf=. If set to yes an extra directory will be created into the unit cgroup to place the newly spawned service process. This is useful for services which need to be restarted while its forked pids remain in the cgroup and the service cgroup is not a leaf anymore. This option is only valid when using Delegate=yes and under a system in unified mode. E.g. in my example, that would end up like this: /sys/fs/cgroup/system.slices/sgamba1.service <-- This is Delegated=yes DelegateMultiCgroups=yes ├── sgamba1 <-- The spawned process would be always put in here by systemd. ├── user1_stuff ├── user2_stuff └── user3_stuff I think this idea could work for cases like the one exposed here, and I see this would be quite useful.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Am Montag, dem 21.02.2022 um 18:07 +0100 schrieb Felip Moll: > The hard requirement that my project has is that processes need to > live even if the daemon who forked them dies. > Roughly it is how a batch scheduler works: one controller sends a > request to my daemon for launching a process in the name of a user, > my daemon forks-exec it. At some point my daemon can be stopped, > restarted, upgraded, whatever but the forked processes need to always > be alive because they are continuing their work. We are talking here > about the HPC world. You could invoke a man:systemd-run for each new process. Than you can put every single job in a seperate .slice with its own man:systemd.resource-control applied. This would also mean that you don't need to compile against libsystemd. Just exec() accordingly if a systemd-system is detected. BR Silvio
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> > Hmm? Hard requirement of what? Not following? > > The hard requirement that my project has is that processes need to live even if the daemon who forked them dies. Roughly it is how a batch scheduler works: one controller sends a request to my daemon for launching a process in the name of a user, my daemon forks-exec it. At some point my daemon can be stopped, restarted, upgraded, whatever but the forked processes need to always be alive because they are continuing their work. We are talking here about the HPC world. > You are leaving processes around when your service dies/restarts? > Yes. > That's a bad idea typically, and a generally a hack: the unit should > probably be split up differently, i.e. the processes that shall stick > around on restart should probably be in their own unit, i.e. another > service or scope unit. > So, if I understand it correctly you are suggesting that every forked process must be started through a new systemd unit? If that's the case it seems inconvenient because we're talking about a job scheduler where sometimes may have thousands of forked processes executed quickly, and where performance is key. Having to manage a unit per each process will probably not work in this situation in terms of performance. The other option I can imagine is to start a new unit from my daemon of Type=forking, which remains forever until I decide to clean it up even if it doesn't have any process inside. Then I could put my processes in the associated cgroup instead of inside the main daemon cgroup. Would that make sense? The issue here is that for creating the new unit I'd need my daemon to depend on systemd libraries, or to do some fork-exec using systemd commands and parsing output. I am trying to keep the dependencies at a minimum and I'd love to have an alternative. > That's not supported. You may only create your own cgroups where you > turned on delegation, otherwise all bets are off. If you put stuff in > /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's > "-.slice" without telling it so, and things will break sooner or > later, and often in non-obvious ways. > Yeah, I know and understand it is not supported, but I am more interested in the technical part of how things would break. I see in systemd/src/core/cgroup.c that it often differentiates a cgroup with delegation with one without it (!unit_cgroup_delegate(u)), but it's hard for me to find out how or where this exactly will mess up with any cgroup created outside of systemd. I'd appreciate it if you can give me some light on why/when/where things will break in practice, or just an example? I am also aware of the single-writer policy that systemd has in its documentation, and I am aware that this is not supported, but I'd like to understand exactly what can happen. Thanks for your help & time :)
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote: > Hello, > > I am creating a software which consists of one daemon which forks several > processes from user requests. > This is basically acting like a job scheduler. > > The daemon is started using a unit file and with Delegate=yes option, > because every process must be constrained differently. I manage my cgroup > hierarchy, create some leaves into the tree and put each pid there. > For example, after starting up the service and receiving 3 user requests, a > tree under /sys/fs/cgroup/system.slice/ could look like: > > sgamba1.service/ > ├── daemon_pid > ├── user1_stuff > ├── user2_stuff > └── user3_stuff > > I create the hierarchy and set cgroup.subtree_control in the root directory > (sgamba1.service in the example) and everything runs smoothly, until when I > decide to restart my service. > > The service then cannot restart: > > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to > cgroup /system.slice/sgamba1.service: Device or resource busy > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step > CGROUP spawning /path_to_bin/mydaemond: Device or resource busy > > This is because systemd tries to put the pid of the new daemon in > sgamba1.service/cgroup.procs and this would break the "no internal process > constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore > because it has subtree_control enabled for the user stuff. > > One hard requirement is that user stuff must live even if the service is > restarted. Hmm? Hard requirement of what? Not following? You are leaving processes around when your service dies/restarts? That's a bad idea typically, and a generally a hack: the unit should probably be split up differently, i.e. the processes that shall stick around on restart should probably be in their own unit, i.e. another service or scope unit. > What's the way to achieve that? I see one easy way, which is to move user > stuff into its own cgroup and out of sgamba1.service/, but then it will run > outside a Delegate=yes unit. What can happen then? > Will systemd eventually migrate my processes? > How do services workaround that issue? > If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will > systemd touch my directories? That's not supported. You may only create your own cgroups where you turned on delegation, otherwise all bets are off. If you but stuff in /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's "-.slice" without telling it so, and things will break sooner or later, and often in non-obvious ways. Lennart -- Lennart Poettering, Berlin
[systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Hello, I am creating a software which consists of one daemon which forks several processes from user requests. This is basically acting like a job scheduler. The daemon is started using a unit file and with Delegate=yes option, because every process must be constrained differently. I manage my cgroup hierarchy, create some leaves into the tree and put each pid there. For example, after starting up the service and receiving 3 user requests, a tree under /sys/fs/cgroup/system.slice/ could look like: sgamba1.service/ ├── daemon_pid ├── user1_stuff ├── user2_stuff └── user3_stuff I create the hierarchy and set cgroup.subtree_control in the root directory (sgamba1.service in the example) and everything runs smoothly, until when I decide to restart my service. The service then cannot restart: feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to cgroup /system.slice/sgamba1.service: Device or resource busy feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step CGROUP spawning /path_to_bin/mydaemond: Device or resource busy This is because systemd tries to put the pid of the new daemon in sgamba1.service/cgroup.procs and this would break the "no internal process constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore because it has subtree_control enabled for the user stuff. One hard requirement is that user stuff must live even if the service is restarted. What's the way to achieve that? I see one easy way, which is to move user stuff into its own cgroup and out of sgamba1.service/, but then it will run outside a Delegate=yes unit. What can happen then? Will systemd eventually migrate my processes? How do services workaround that issue? If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will systemd touch my directories? Thank you. *--Felip Moll* E-Mail - lip...@gmail.com Tlf. - +34 659 69 40 47