Re: [systemd-devel] Antw: Re: [systemd‑devel] [EXT] Proposal to extend os‑release/machine‑info with field PREFER_HARDENED_CONFIG

2022-02-21 Thread Peter Hoeg




To what extent a machine is locked down is a policy choice. There are
already loads of tools available to manage policy so this really doesn't
belong here and if you want to ensure that your fleet of machines are locked
down through something like PREFER_HARDENED_CONFIG=1, you're going to need
tools to manage *that* anyway. Then why not use the same tool(s) to simply
manage the machines?


And what exactly should it do?


I'm sorry, but what is "it" in this context?


Also: Do you really believe in "one size fits all" security-wise?


Of course not. I think distributions should be providing sane defaults and 
everything else is a policy decision that whoever is responsible for a 
particular machine would then implement using one of the many tools that 
already exist.


If (at all), then the parameter should be "SECURITY_POLICY=name" (where name
is one of the predefined policies).


One of the ideas behind the systemd project was to provide plumbing for all 
distributions that would provide some level of standardization and each 
distribution not having to reinvent the wheel.

Introducing something like SECURITY_POLICY=woot which inevitably would mean 
different things from distribution to distribution and even from package to 
package within a distribution doesn't seem like it would further that goal.


And most of all, selecting a different policy does not make it a different OS.


For sure, but I don't quite see which point you're trying to make.


[systemd-devel] DeviceAllow=/dev/net/tun in systemd-nspawn@.service has no effect

2022-02-21 Thread Gibeom Gwon

Hello,

Just out of curiosity, I commented out DeviceAllow=/dev/net/tun rwm in
systemd-nspawn@.service and tried running. A failure was expected, but
it was not.

copy_devnodes() in src/nspawn/nspawn.c executes mknod() on /dev/net/tun,
EPERM is expected because DeviceAllow=/dev/net/tun rwm does not exist.
But /dev/net/tun was created and systemd-nspawn was not failed.

Doesn't DeviceAllow= apply to child processes spawned by
raw_clone(SIGCHLD|CLONE_NEWNS) or any other reasons?

I'm using arch linux, kernel is 5.16.10 and systemd is 250.3.

Here is the output. I also commented out
DeviceAllow=char-pts rw and it didn't fail:

sh-5.1# tail -n 20 /usr/lib/systemd/system/systemd-nspawn\@.service
TasksMax=16384
WatchdogSec=3min

DevicePolicy=closed
#DeviceAllow=/dev/net/tun rwm
#DeviceAllow=char-pts rw

# nspawn itself needs access to /dev/loop-control and /dev/loop, to
implement
# the --image= option. Add these here, too.
DeviceAllow=/dev/loop-control rw
DeviceAllow=block-loop rw
DeviceAllow=block-blkext rw

# nspawn can set up LUKS encrypted loopback files, in which case it needs
# access to /dev/mapper/control and the block devices /dev/mapper/*.
DeviceAllow=/dev/mapper/control rw
DeviceAllow=block-device-mapper rw

[Install]
WantedBy=machines.target
sh-5.1# systemctl start systemd-nspawn@test
sh-5.1# machinectl
MACHINE CLASS SERVICEOS   VERSION ADDRESSES
testcontainer systemd-nspawn arch -   -

1 machines listed.
sh-5.1# machinectl shell test
Connected to machine test. Press ^] three times within 1s to exit session.
[root@test ~]# ls -l /dev/net/tun
crw-rw-rw- 1 root root 10, 200 Feb 20 05:13 /dev/net/tun

Regards,
Gibeom Gwon


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Silvio Knizek
Am Montag, dem 21.02.2022 um 22:16 +0100 schrieb Felip Moll:
> Silvio,
>
> As I commented in my previous post, creating every single job in a
> separate slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra
> fork+execs plus waiting for systemd to fill up its internal
> structures and manage it all is a no-no.
And how about an xinitd style daemon, excepting connections and
spawning processes that way?
So instead of sgamba1.service you would have a sgamba1@.service and a
sgamba1.socket, spawning sgamba1@user1.service, sgamba1@user2.service,
etc. units.
So even if one user process dies, nothing else dies. And the setup
overhead would only be once everytime a user creates a new connection.
So they can still drop their one million jobs and you has still user
isolation.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Barry


> On 21 Feb 2022, at 21:16, Felip Moll  wrote:
> 
> 
> 
>> You could invoke a man:systemd-run for each new process. Than you can
>> put every single job in a seperate .slice with its own
>> man:systemd.resource-control applied.
>> This would also mean that you don't need to compile against libsystemd.
>> Just exec() accordingly if a systemd-system is detected.
>> 
>> BR
>> Silvio
> 
> Silvio,
> 
> As I commented in my previous post, creating every single job in a separate 
> slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra 
> fork+execs plus waiting for systemd to fill up its internal structures and 
> manage it all is a no-no.

Are you assuming this or did you measure the cost?

Barry

> 
> One other option that I am thinking about is extending the parameters of a 
> unit file, for example adding a DelegateCgroupLeaf=yes option.
> 
> DelegateCgroupLeaf=. If set to yes an extra directory will be created 
> into the unit cgroup to place the newly spawned service process. This is 
> useful for services which need to be restarted while its forked pids remain 
> in the cgroup and the service cgroup is not a leaf anymore. This option is 
> only valid when using Delegate=yes and under a system in unified mode.
> 
> E.g. in my example, that would end up like this:
> /sys/fs/cgroup/system.slices/sgamba1.service   <-- This is Delegated=yes 
> DelegateMultiCgroups=yes
> ├── sgamba1   <-- The spawned process would be always put in here by 
> systemd.
> ├── user1_stuff
> ├── user2_stuff
> └── user3_stuff
> 
> I think this idea could work for cases like the one exposed here, and I see 
> this would be quite useful.
>  


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Felip Moll
You could invoke a man:systemd-run for each new process. Than you can
> put every single job in a seperate .slice with its own
> man:systemd.resource-control applied.
> This would also mean that you don't need to compile against libsystemd.
> Just exec() accordingly if a systemd-system is detected.
>
> BR
> Silvio
>

Silvio,

As I commented in my previous post, creating every single job in a separate
slice is an overhead I cannot assume.
An HTC system could run thousands of jobs per second, and doing extra
fork+execs plus waiting for systemd to fill up its internal structures and
manage it all is a no-no.

One other option that I am thinking about is extending the parameters of a
unit file, for example adding a DelegateCgroupLeaf=yes option.

DelegateCgroupLeaf=. If set to yes an extra directory will be
created into the unit cgroup to place the newly spawned service process.
This is useful for services which need to be restarted while its forked
pids remain in the cgroup and the service cgroup is not a leaf anymore.
This option is only valid when using Delegate=yes and under a system in
unified mode.

E.g. in my example, that would end up like this:
/sys/fs/cgroup/system.slices/sgamba1.service   <-- This is
Delegated=yes DelegateMultiCgroups=yes
├── sgamba1   <-- The spawned process would be always put in here by
systemd.
├── user1_stuff
├── user2_stuff
└── user3_stuff

I think this idea could work for cases like the one exposed here, and I see
this would be quite useful.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Silvio Knizek
Am Montag, dem 21.02.2022 um 18:07 +0100 schrieb Felip Moll:
> The hard requirement that my project has is that processes need to
> live even if the daemon who forked them dies.
> Roughly it is how a batch scheduler works: one controller sends a
> request to my daemon for launching a process in the name of a user,
> my daemon forks-exec it. At some point my daemon can be stopped,
> restarted, upgraded, whatever but the forked processes need to always
> be alive because they are continuing their work. We are talking here
> about the HPC world.
You could invoke a man:systemd-run for each new process. Than you can
put every single job in a seperate .slice with its own
man:systemd.resource-control applied.
This would also mean that you don't need to compile against libsystemd.
Just exec() accordingly if a systemd-system is detected.

BR
Silvio


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Felip Moll
>
> Hmm? Hard requirement of what? Not following?
>
>
The hard requirement that my project has is that processes need to live
even if the daemon who forked them dies.
Roughly it is how a batch scheduler works: one controller sends a request
to my daemon for launching a process in the name of a user, my daemon
forks-exec it. At some point my daemon can be stopped, restarted, upgraded,
whatever but the forked processes need to always be alive because they are
continuing their work. We are talking here about the HPC world.


> You are leaving processes around when your service dies/restarts?
>

Yes.


> That's a bad idea typically, and a generally a hack: the unit should
> probably be split up differently, i.e. the processes that shall stick
> around on restart should probably be in their own unit, i.e. another
> service or scope unit.
>

So, if I understand it correctly you are suggesting that every forked
process must be started through a new systemd unit?
If that's the case it seems inconvenient because we're talking about a job
scheduler where sometimes may have thousands of forked processes executed
quickly, and where performance is key.
Having to manage a unit per each process will probably not work in this
situation in terms of performance.

The other option I can imagine is to start a new unit from my daemon of
Type=forking, which remains forever until I decide to clean it up even if
it doesn't have any process inside.
Then I could put my processes in the associated cgroup instead of inside
the main daemon cgroup. Would that make sense?

The issue here is that for creating the new unit I'd need my daemon to
depend on systemd libraries, or to do some fork-exec using systemd commands
and parsing output.
I am trying to keep the dependencies at a minimum and I'd love to have an
alternative.


> That's not supported. You may only create your own cgroups where you
> turned on delegation, otherwise all bets are off. If you put stuff in
> /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's
> "-.slice" without telling it so, and things will break sooner or
> later, and often in non-obvious ways.
>

Yeah, I know and understand it is not supported, but I am more interested
in the technical part of how things would break.
I see in systemd/src/core/cgroup.c that it often differentiates a cgroup
with delegation with one without it (!unit_cgroup_delegate(u)), but it's
hard for me to find out how or where this exactly will mess up with any
cgroup created outside of systemd. I'd appreciate it if you can give me
some light on why/when/where things will break in practice, or just an
example?

I am also aware of the single-writer policy that systemd has in its
documentation, and I am aware that this is not supported, but I'd like to
understand exactly what can happen.


Thanks for your help & time :)


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Lennart Poettering
On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote:

> Hello,
>
> I am creating a software which consists of one daemon which forks several
> processes from user requests.
> This is basically acting like a job scheduler.
>
> The daemon is started using a unit file and with Delegate=yes option,
> because every process must be constrained differently. I manage my cgroup
> hierarchy, create some leaves into the tree and put each pid there.
> For example, after starting up the service and receiving 3 user requests, a
> tree under /sys/fs/cgroup/system.slice/ could look like:
>
> sgamba1.service/
> ├── daemon_pid
> ├── user1_stuff
> ├── user2_stuff
> └── user3_stuff
>
> I create the hierarchy and set cgroup.subtree_control in the root directory
> (sgamba1.service in the example) and everything runs smoothly, until when I
> decide to restart my service.
>
> The service then cannot restart:
>
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to
> cgroup /system.slice/sgamba1.service: Device or resource busy
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step
> CGROUP spawning /path_to_bin/mydaemond: Device or resource busy
>
> This is because systemd tries to put the pid of the new daemon in
> sgamba1.service/cgroup.procs and this would break the "no internal process
> constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore
> because it has subtree_control enabled for the user stuff.
>
> One hard requirement is that user stuff must live even if the service is
> restarted.

Hmm? Hard requirement of what? Not following?

You are leaving processes around when your service dies/restarts?
That's a bad idea typically, and a generally a hack: the unit should
probably be split up differently, i.e. the processes that shall stick
around on restart should probably be in their own unit, i.e. another
service or scope unit.

> What's the way to achieve that? I see one easy way, which is to move user
> stuff into its own cgroup and out of sgamba1.service/, but then it will run
> outside a Delegate=yes unit. What can happen then?
> Will systemd eventually migrate my processes?
> How do services workaround that issue?
> If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will
> systemd touch my directories?

That's not supported. You may only create your own cgroups where you
turned on delegation, otherwise all bets are off. If you but stuff in
/sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's
"-.slice" without telling it so, and things will break sooner or
later, and often in non-obvious ways.

Lennart

--
Lennart Poettering, Berlin


[systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Felip Moll
Hello,

I am creating a software which consists of one daemon which forks several
processes from user requests.
This is basically acting like a job scheduler.

The daemon is started using a unit file and with Delegate=yes option,
because every process must be constrained differently. I manage my cgroup
hierarchy, create some leaves into the tree and put each pid there.
For example, after starting up the service and receiving 3 user requests, a
tree under /sys/fs/cgroup/system.slice/ could look like:

sgamba1.service/
├── daemon_pid
├── user1_stuff
├── user2_stuff
└── user3_stuff

I create the hierarchy and set cgroup.subtree_control in the root directory
(sgamba1.service in the example) and everything runs smoothly, until when I
decide to restart my service.

The service then cannot restart:

feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to
cgroup /system.slice/sgamba1.service: Device or resource busy
feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step
CGROUP spawning /path_to_bin/mydaemond: Device or resource busy

This is because systemd tries to put the pid of the new daemon in
sgamba1.service/cgroup.procs and this would break the "no internal process
constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore
because it has subtree_control enabled for the user stuff.

One hard requirement is that user stuff must live even if the service is
restarted.

What's the way to achieve that? I see one easy way, which is to move user
stuff into its own cgroup and out of sgamba1.service/, but then it will run
outside a Delegate=yes unit. What can happen then?
Will systemd eventually migrate my processes?
How do services workaround that issue?
If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will
systemd touch my directories?

Thank you.


*--Felip Moll*
E-Mail - lip...@gmail.com
Tlf. - +34 659 69 40 47