Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-28 Thread Lennart Poettering
On Do, 24.03.22 14:32, Benjamin Berg (benja...@sipsolutions.net) wrote:

> HI,
>
> On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote:
> > False, the JobRemoved signal returns the id, job, unit and result. To
> > wait for JobRemoved only needs a matching rule for this signal. The
> > matching rule can just contain the path. In fact, nothing else than
> > strings can be matched in a rule, so I may be only able to use the
> > path.
>
> I think you need to add a wildcard match before the job is created
> (i.e. before StartTransientUnit). Otherwise registering the match rule
> (using the job's object path) will race with systemd signalling that
> the job has completed.

Correct.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-28 Thread Lennart Poettering
On Do, 24.03.22 00:45, Felip Moll (fe...@schedmd.com) wrote:

> Hi, some days ago we were talking about this:
>
>
> > > Problem number two, there's a significant delay since when creating the
> > > scope, until it is ready and the pid attached into it. The only way it
> > > worked was to put a 'sleep' after the dbus call and make my process wait
> > > for the async call to dbus to be materialized. This is really
> > > un-elegant.
> >
> > If you want to synchronize in the cgroup creation to complete just
> > wait for the JobRemoved bus signal for the job returned by
> > StartTransientUnit().
> >
> >
> StartTransientUnit returns a string to a job object path. To call
> JobRemoved I need the job id, so the easier way to get it is to strip the
> last part of the returned string from StartTransientUnit job object path.
> Am I right?

JobRemoved is a signal, not a method call. i.e. not something you
call, but you are notified about. And it originates from an object and
objects have object paths in D-Bus.

> Once I have the job id, I can then subscribe to JobRemoved bus signal for
> the recently created job, but what happens if during the time I am
> obtaining the ID or parsing the output, the job is finished? Will I lose
> the signal?

Yes. D-Bus sucks that way. You ave to subscribe to all jobs first, and
the filte rout the ones you don#t want.

> What is the correct order of doing a StartTransientUnit and wait for the
> job to be finished (done, failed, whatever) ?

first subscribe to JobRemoved, then issue StartTransientUnit, and then
wait until you see JobRemoved for the unit you just started.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-24 Thread Benjamin Berg
HI,

On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote:
> False, the JobRemoved signal returns the id, job, unit and result. To
> wait for JobRemoved only needs a matching rule for this signal. The
> matching rule can just contain the path. In fact, nothing else than
> strings can be matched in a rule, so I may be only able to use the
> path.

I think you need to add a wildcard match before the job is created
(i.e. before StartTransientUnit). Otherwise registering the match rule
(using the job's object path) will race with systemd signalling that
the job has completed.

Benjamin


signature.asc
Description: This is a digitally signed message part


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-24 Thread Felip Moll
I respond to myself to the first part of my question.

If you want to synchronize in the cgroup creation to complete just
>> wait for the JobRemoved bus signal for the job returned by
>> StartTransientUnit().
>>
>>
> StartTransientUnit returns a string to a job object path. To call
> JobRemoved I need the job id, so the easier way to get it is to strip the
> last part of the returned string from StartTransientUnit job object path.
> Am I right?
>
>
False, the JobRemoved signal returns the id, job, unit and result. To wait
for JobRemoved only needs a matching rule for this signal. The matching
rule can just contain the path. In fact, nothing else than strings can be
matched in a rule, so I may be only able to use the path.



>


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-23 Thread Felip Moll
Hi, some days ago we were talking about this:


> > Problem number two, there's a significant delay since when creating the
> > scope, until it is ready and the pid attached into it. The only way it
> > worked was to put a 'sleep' after the dbus call and make my process wait
> > for the async call to dbus to be materialized. This is really
> > un-elegant.
>
> If you want to synchronize in the cgroup creation to complete just
> wait for the JobRemoved bus signal for the job returned by
> StartTransientUnit().
>
>
StartTransientUnit returns a string to a job object path. To call
JobRemoved I need the job id, so the easier way to get it is to strip the
last part of the returned string from StartTransientUnit job object path.
Am I right?

Once I have the job id, I can then subscribe to JobRemoved bus signal for
the recently created job, but what happens if during the time I am
obtaining the ID or parsing the output, the job is finished? Will I lose
the signal?

What is the correct order of doing a StartTransientUnit and wait for the
job to be finished (done, failed, whatever) ?


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-18 Thread Felip Moll
On Wed, Mar 16, 2022 at 6:29 PM Felip Moll  wrote:

>
> On Wed, Mar 16, 2022 at 5:53 PM Lennart Poettering 
> wrote:
>
>> On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote:
>>
>> > AFAIK RemainAfterExit for services actually does cleanup the cgroup
>> tree if
>> > there are no more processes in it.
>>
>> It doesn't do that if delegation is on (iirc, if not I'd consider that
>> a bug). Same logic should apply here.
>>
>>
> I will recheck that, but I am quite sure that on some tests I did the
> cgroup was cleaned up on a delegated service after the main pid terminated.
>
>
Is that a bug then?

1. Start a service with Delegate=yes and RemainAfterExit
2. Wait for the main process to start
3. Check that the unit is still active
4. Check that the cgroup is still there <--- It is gone when no pids in it

]# systemd-run -u test -p "Delegate=yes" -p "RemainAfterExit=yes" sleep 60
Running as unit: test.service
]# systemctl status test.service
● test.service - /usr/bin/sleep 60
 Loaded: loaded (/run/systemd/transient/test.service; transient)
  Transient: yes
 Active: active (running) since Fri 2022-03-18 09:47:32 CET; 5s ago
   Main PID: 6083 (sleep)
  Tasks: 1 (limit: 14068)
 Memory: 316.0K
 CGroup: /system.slice/test.service
 └─6083 /usr/bin/sleep 60

de març 18 09:47:32 llagosti systemd[1]: Started /usr/bin/sleep 60.
]# cat /proc/6083/cgroup
12:perf_event:/
11:pids:/system.slice/test.service
10:devices:/system.slice/test.service
9:cpuset:/
8:blkio:/system.slice/test.service
7:net_cls,net_prio:/
6:memory:/system.slice/test.service
5:misc:/
4:cpu,cpuacct:/system.slice/test.service
3:hugetlb:/
2:freezer:/
1:name=systemd:/system.slice/test.service
0::/system.slice/test.service
]# ls /sys/fs/cgroup/memory/system.slice/test.service/
cgroup.clone_children  memory.kmem.failcnt memory.kme...
..
[root@llagosti slurm.gitlab.lipixx]# systemctl status test.service
● test.service - /usr/bin/sleep 60
 Loaded: loaded (/run/systemd/transient/test.service; transient)
  Transient: yes
 Active: active (exited) since Fri 2022-03-18 09:47:32 CET; 1min 21s ago
Process: 6083 ExecStart=/usr/bin/sleep 60 (code=exited,
status=0/SUCCESS)
   Main PID: 6083 (code=exited, status=0/SUCCESS)

de març 18 09:47:32 llagosti systemd[1]: Started /usr/bin/sleep 60.
]# ls /sys/fs/cgroup/memory/system.slice/test.service/
ls: cannot access '/sys/fs/cgroup/memory/system.slice/test.service/': No
such file or directory
]# systemctl cat test.service
# /run/systemd/transient/test.service
# This is a transient unit file, created programmatically via the systemd
API. Do not edit.
[Unit]
Description=/usr/bin/sleep 60

[Service]
Delegate=yes
RemainAfterExit=yes
ExecStart=
ExecStart="/usr/bin/sleep" "60"


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Felip Moll
On Wed, Mar 16, 2022 at 5:53 PM Lennart Poettering 
wrote:

> On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote:
>
> > AFAIK RemainAfterExit for services actually does cleanup the cgroup tree
> if
> > there are no more processes in it.
>
> It doesn't do that if delegation is on (iirc, if not I'd consider that
> a bug). Same logic should apply here.
>
>
I will recheck that, but I am quite sure that on some tests I did the
cgroup was cleaned up on a delegated service after the main pid terminated.



> > If that behavior of keeping the cgroup tree even if there are no pids is
> > what you agree with, then I coincide is a good idea to include this
> option
> > to scopes.
>
> Yes, that is what I was suggesting this would do.
>
>
Excellent.



> > Or are you saying that I can just migrate processes wildly without
> > informing systemd and just doing an 'echo > cgroup.procs' from one
> > non-delegated tree to my delegated subtree?
>
> yeah, you can do that.
>
>
Ok, so I understood that incorrectly from a former paragraph you wrote in
our first e-mails. you said:

> Migrating processes wildly between cgroups is messy, because it fucks
> up accounting and is restricted permission-wise. Typically you want to
> create a cgroup and populate it, and then stick to that.

There, I understood you were referring to "systemd" accounting, not
"kernel" accounting.
This has been a big misunderstanding for this issue.


> Note that (independently of systemd) you shouldn't migrate stuff to
> aggressively, since it fucks up kernel resource accounting. i.e. it is
> wise to minimize process migration in cgroups and always migrate plus
> shortly after exec()
>

Yeah, that makes sense and I am aware of it.
I am migrating before any real work is done, exactly as you describe.


I will continue a bit more with this and inform you on what I see, but we
seem to be close to a solution.

Thank you!.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering
On Mi, 16.03.22 17:35, Michal Koutný (mkou...@suse.com) wrote:

> True, in the unified mode it should be safe doing manually.
> I was worried about migrating e.g. MainPID of a service into this scope
> but PID1 should handle that AFAICS. Also since this has to be performed
> by the privileged user (scopes are root's), the manual migration works.

This is actually a common case: for getty style login process the main
process of the getty service will migrate to the new scope. A service
is thus always a cgroup *and* a main pid for us, in case the main pid
is outside of the cgroup. And conversely, a process can be associated
to multiple units this way. It can be main pid of one service and be
in a cgroup of a scope.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering
On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote:

> > > (The above is slightly misleading) there could be an alternative of
> > > something like RemainAfterExit=yes for scopes, i.e. such scopes would
> > > not be stopped after last process exiting (but systemd would still be in
> > > charge of cleaning the cgroup after explicit stop request and that'd
> > > also mark the scope as truly stopped).
> >
> > Yeah, I'd be fine with adding RemainAfterExit= to scope units
> >
> >
> Note that what Michal is saying is "something like RemainAfterExit=yes for
> scopes", which means systemd would NOT clean up the cgroup tree when there
> are no processes inside.
> AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if
> there are no more processes in it.

It doesn't do that if delegation is on (iirc, if not I'd consider that
a bug). Same logic should apply here.

> If that behavior of keeping the cgroup tree even if there are no pids is
> what you agree with, then I coincide is a good idea to include this option
> to scopes.

Yes, that is what I was suggesting this would do.

> > > Such a recycled scope would only be useful via
> > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
> >
> > Well, if delegation is on, then people don#t really have to use our
> > API, they can just do that themselves.
>
> That's not exact. If slurmd (my main process) forks a slurmstepd (child
> process) and I want to move slurmstepd into a delegated subtree from the
> scope I already created, I must use AttachProcessesToUnit(), isn't that
> true?

depends on your privs. You can just move it yourself if you have
enough privs.

See commit msg in 6592b9759cae509b407a3b49603498468bf5d276

> Or are you saying that I can just migrate processes wildly without
> informing systemd and just doing an 'echo > cgroup.procs' from one
> non-delegated tree to my delegated subtree?

yeah, you can do that.

Note that (independently of systemd) you shouldn't migrate stuff to
aggressively, since it fucks up kernel resource accounting. i.e. it is
wise to minimize process migration in cgroups and always migrate plus
shortly after exec(), or even better do a clone(CLONE_INTO_CGROUP) –
though unfortunately the latter cannot work with glibc right now :-(.

i.e. keeping processes that already "have history" around for a long
time after migration kinda sucks.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Michal Koutný
On Wed, Mar 16, 2022 at 05:06:28PM +0100, Lennart Poettering 
 wrote:
> > That owner would be a process -- bang, you created a service with
> > delegation or a scope with "keepalive" process.
> 
> can't parse this.

That was meant as a humorous proof by contradiction that delegation on
slices is unnecessary. Nvm.

> > (The above is slightly misleading) there could be an alternative of
> > something like RemainAfterExit=yes for scopes, i.e. such scopes would
> > not be stopped after last process exiting (but systemd would still be in
> > charge of cleaning the cgroup after explicit stop request and that'd
> > also mark the scope as truly stopped).
> 
> Yeah, I'd be fine with adding RemainAfterExit= to scope units

Felip, I'd happily review such a PR ;-)


> > Such a recycled scope would only be useful via
> > org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
> 
> Well, if delegation is on, then people don#t really have to use our
> API, they can just do that themselves.

True, in the unified mode it should be safe doing manually.
I was worried about migrating e.g. MainPID of a service into this scope
but PID1 should handle that AFAICS. Also since this has to be performed
by the privileged user (scopes are root's), the manual migration works.

Michal


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Felip Moll
> > (The above is slightly misleading) there could be an alternative of
> > something like RemainAfterExit=yes for scopes, i.e. such scopes would
> > not be stopped after last process exiting (but systemd would still be in
> > charge of cleaning the cgroup after explicit stop request and that'd
> > also mark the scope as truly stopped).
>
> Yeah, I'd be fine with adding RemainAfterExit= to scope units
>
>
Note that what Michal is saying is "something like RemainAfterExit=yes for
scopes", which means systemd would NOT clean up the cgroup tree when there
are no processes inside.
AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if
there are no more processes in it.

If that behavior of keeping the cgroup tree even if there are no pids is
what you agree with, then I coincide is a good idea to include this option
to scopes.



> > Such a recycled scope would only be useful via
> > org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
>
> Well, if delegation is on, then people don#t really have to use our
> API, they can just do that themselves.
>
>
That's not exact. If slurmd (my main process) forks a slurmstepd (child
process) and I want to move slurmstepd into a delegated subtree from the
scope I already created, I must use AttachProcessesToUnit(), isn't that
true?
Or are you saying that I can just migrate processes wildly without
informing systemd and just doing an 'echo > cgroup.procs' from one
non-delegated tree to my delegated subtree?


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering
On Mi, 16.03.22 16:15, Felip Moll (fe...@schedmd.com) wrote:

> On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný  wrote:
>
> > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll 
> > wrote:
> > > Meaning that it would be great to have a delegated cgroup subtree without
> > > the need of a service or scope.
> > > Just an empty subtree.
> >
> > It looks appealing to add Delegate= directive to slice units.
> > Firstly, that'd prevent the use of the slice by anything systemd.
> > Then some notion of owner of that subtree would have to be defined (if
> > only for cleanup).
> > That owner would be a process -- bang, you created a service with
> > delegation or a scope with "keepalive" process.
> >
> >
> Correct, this is how the current systemd design works.
> But... what if the concept of owner was irrelevant? What if we could just
> tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it
> or do anything to it or pids residing into it.

No, that's not something we will offer. We bind a lot of meaning to
the cgroup concept. i.e. we derive unit info from it, and many things
are based on that. For example any client logging to journald will do
so from a cgroup and we pick that up to know which service logging is
from, and store that away and use it for filtering, for picking
per-unit log settings and so on.

Moreover we need to be able to shutdown all processes on the system in
a systematic way for shutdown, and we do that based on units, and the
ordering between them. Having processes and cgroups that live entirely
independent makes a total mess from this.

And there's a lot more, like resource mgmt: we want that all processes
on the system are placed in a unit of some form so that we can apply
useful resource mgmt to it.

So yes you can have a delegated subtree, if you like and we'll not
interfere with what you do there mostly, but it must be a leaf of our
tree, and we'll "macro manage" it for you, i.e. define a lifetime for
it, and track processes back to it.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering
On Di, 15.03.22 17:24, Michal Koutný (mkou...@suse.com) wrote:

> On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll  
> wrote:
> > Meaning that it would be great to have a delegated cgroup subtree without
> > the need of a service or scope.
> > Just an empty subtree.
>
> It looks appealing to add Delegate= directive to slice units.

Hm? Slice units are *inner* node of *our* cgroup trees. if we'd allow
delegation of that, then we'd could not put stuff inside it, hence it
wouldn't be a slice because it couldn#t contain anything anymore.

> Firstly, that'd prevent the use of the slice by anything systemd.

yeah, precisely? i don't follow. What would a slice with delegation be
that a scope with delegation isn't already?

> Then some notion of owner of that subtree would have to be defined (if
> only for cleanup).

scopes already have that, so why not use that?

> That owner would be a process -- bang, you created a service with
> delegation or a scope with "keepalive" process.

can't parse this.

> (The above is slightly misleading) there could be an alternative of
> something like RemainAfterExit=yes for scopes, i.e. such scopes would
> not be stopped after last process exiting (but systemd would still be in
> charge of cleaning the cgroup after explicit stop request and that'd
> also mark the scope as truly stopped).

Yeah, I'd be fine with adding RemainAfterExit= to scope units

> Such a recycled scope would only be useful via
> org.freedesktop.systemd1.Manager.AttachProcessesToUnit().

Well, if delegation is on, then people don#t really have to use our
API, they can just do that themselves.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Lennart Poettering
On Di, 15.03.22 16:35, Felip Moll (fe...@schedmd.com) wrote:

> > I don't follow. You can enable delegation on the scope. I mean, that's
> > the reason I suggested to use a scope.
> >
> >
> Meaning that it would be great to have a delegated cgroup subtree without
> the need of a service or scope.
> Just an empty subtree.

That's what a scope is. I don't follow?

What do you think a scope is beyond that? It just encapsulates a
cgroup subtree. It auto-cleans it though once it goes empty, and
because it does that it also requires you to provide at least one PID
to add to the scope when it is created.

For services we have a RemainAfterExit= property btw. There were
requests for adding the same for scopes. I'd be fine with adding that,
happy to take a patch.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-16 Thread Felip Moll
On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný  wrote:

> On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll 
> wrote:
> > Meaning that it would be great to have a delegated cgroup subtree without
> > the need of a service or scope.
> > Just an empty subtree.
>
> It looks appealing to add Delegate= directive to slice units.
> Firstly, that'd prevent the use of the slice by anything systemd.
> Then some notion of owner of that subtree would have to be defined (if
> only for cleanup).
> That owner would be a process -- bang, you created a service with
> delegation or a scope with "keepalive" process.
>
>
Correct, this is how the current systemd design works.
But... what if the concept of owner was irrelevant? What if we could just
tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it
or do anything to it or pids residing into it.



> (The above is slightly misleading) there could be an alternative of
> something like RemainAfterExit=yes for scopes, i.e. such scopes would
> not be stopped after last process exiting (but systemd would still be in
> charge of cleaning the cgroup after explicit stop request and that'd
> also mark the scope as truly stopped).
> Such a recycled scope would only be useful via
> org.freedesktop.systemd1.Manager.AttachProcessesToUnit().
>
>
This is also a good idea.



> BTW I'm also wondering how do you detect a job finishing in the case
> original parent is gone (due to main service restart) and job's main
> process reparented?
>
>
slurmstepd connects to slurmd through socket and sends an RPC.
If slurmd is gone, slurmstepd (child) will retry the RPC and remain until
slurmd appears again and responds.

The main process doesn't wait for their child, but instead we do a double
fork to make the child be parented by init process 1.


> BTW 2 You didn't like having a scope for each job. Is it because of the
> setup time (IOW jobs are short-lived) or persistent scopes overhead (too
> many units, PID1 scalability)?
>

It is not that I didn't like it. It is that I observed a delay in step
creation (fork slurmstepd) because sending an async dbus message required
the stepd to wait for the systemd job to be executed, and it can take time;
computationally a lot more than just a mkdir on the cgroup subtree. Just to
put an example, a 'srun hostname' command starts a job which runs a
hostname. Response is instantaneous with mkdir's but it takes almost 1
second with a call to systemd through dbus. Slurm is used for HPC, but also
for HTC (High Throughput Computing), which means hundreds of jobs can be
started in a short period of time, so yes, this delay is critical, and not
only because jobs can be short-lived, but there can be a massive job finish
+ job start at the same time. I just ran one test of our regression and
'systemctl list-unit-files' responsiveness was compromised. Also from the
point of view of a sysadmin this was not ideal, so as you say scalability
of PID1 is also a concern.

This is the reason I will not be using 1 scope per job, and I prefer the
other solution to have 1 single scope with Delegate=yes.

Does it make sense?


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Michal Koutný
On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll  wrote:
> Meaning that it would be great to have a delegated cgroup subtree without
> the need of a service or scope.
> Just an empty subtree.

It looks appealing to add Delegate= directive to slice units.
Firstly, that'd prevent the use of the slice by anything systemd.
Then some notion of owner of that subtree would have to be defined (if
only for cleanup).
That owner would be a process -- bang, you created a service with
delegation or a scope with "keepalive" process.

(The above is slightly misleading) there could be an alternative of
something like RemainAfterExit=yes for scopes, i.e. such scopes would
not be stopped after last process exiting (but systemd would still be in
charge of cleaning the cgroup after explicit stop request and that'd
also mark the scope as truly stopped).
Such a recycled scope would only be useful via
org.freedesktop.systemd1.Manager.AttachProcessesToUnit().

BTW I'm also wondering how do you detect a job finishing in the case
original parent is gone (due to main service restart) and job's main
process reparented?

BTW 2 You didn't like having a scope for each job. Is it because of the
setup time (IOW jobs are short-lived) or persistent scopes overhead (too
many units, PID1 scalability)?

Michal


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Felip Moll
> It's shown as active, so where is the problem?
>
>
I have found the problem.
I start my main process (slurmd) on a terminal, which then forks-exec a
/bin/sleep infinity and creates a new scope adding the pid of the sleep.

If the slurmd is terminated with ctrl+c then the child processes die, so
the scope is destroyed. So I need to daemonize the sleep.
Or... use a service directly.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Felip Moll
On Tue, Mar 15, 2022 at 1:29 PM Lennart Poettering 
wrote:

> On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote:
>
> > > But note that you can also run your main service as a service, and
> > > then allocate a *single* scope unit for *all* your payloads.
> >
> > The main issue is the scope needs a pid attached to it. I thought that
> the
> > scope could live without any process inside, but that's not happening.
> > So every time a user step/job finishes, my main process must take care of
> > it, and launch the scope again on the next coming job.
>
> Leave a stub process around in it. i.e something similar to
> "/bin/sleep infinity".
>
>
Ok.. this was my initial idea.


> > The forked process just does the dbus call, and when the scope is ready
> it
> > is moved to the corresponding cgroup (PIDFile=).
>
> Hmm? PIDFile= is a property of *services*, not *scopes*.
>
>
Sorry I meant PIDs, not PIDFile of course.


> And "scopes" cannot be moved to "cgroups". I cannot parse the above.
>
>
The forked process X does the dbus call to start the scope with
PIDs=$(pidof X), and when the scope is ready,
X is moved into the scope cgroup.


> Did you read up on scopes and services?
>
> See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of
> "scopes". Scopes *have* cgroups, but cannot be moved to "cgroups".
>
>
Yes, it was a misunderstanding of my previous sentence.


> > Problem number one: if other processes are in the scope, the dbus call
> > won't work since I am using the same name all the time, e.g.
> > slurmstepd.scope.
> > So I first need to check if the scope exists and if so put the new
> > slurmstepd process inside. But we still have the race condition, if
> during
> > this phase all steps ends, systemd will do the cleanup.
>
> Leave a stub process around in it.


Ok, then I don't see the real difference of starting up a new service.


> > If instead I could just ask systemd to delegate a part of the tree for my
> > processes, then everything would be solved.
>
> I don't follow. You can enable delegation on the scope. I mean, that's
> the reason I suggested to use a scope.
>
>
Meaning that it would be great to have a delegated cgroup subtree without
the need of a service or scope.
Just an empty subtree.


> > Do you have any other suggestions?
>
> Not really, except maybe: please read up on the documentation, it
> explains a lot of the concepts.
>
>
I've done, I may not be expressing myself perfectly though. I apologize for
that.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Lennart Poettering
On Di, 15.03.22 10:50, Felip Moll (fe...@schedmd.com) wrote:

> Another thing I have found is that if the process which created the scope
> (e.g. my main process, slurmd) terminates, then the scope is stopped even
> if I abandoned it and there's a pid inside.
> So this makes the proposed solution not working. What am I missing?
>
> ● gamba11_slurmstepd.scope
>  Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope;
> transient)
>  Transient: yes
>  Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago

It's shown as active, so where is the problem?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Lennart Poettering
On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote:

> > But note that you can also run your main service as a service, and
> > then allocate a *single* scope unit for *all* your payloads.
>
> The main issue is the scope needs a pid attached to it. I thought that the
> scope could live without any process inside, but that's not happening.
> So every time a user step/job finishes, my main process must take care of
> it, and launch the scope again on the next coming job.

Leave a stub process around in it. i.e something similar to
"/bin/sleep infinity".

> The forked process just does the dbus call, and when the scope is ready it
> is moved to the corresponding cgroup (PIDFile=).

Hmm? PIDFile= is a property of *services*, not *scopes*.

And "scopes" cannot be moved to "cgroups". I cannot parse the above.

Did you read up on scopes and services?

See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of
"scopes". Scopes *have* cgroups, but cannot be moved to "cgroups".

> Problem number one: if other processes are in the scope, the dbus call
> won't work since I am using the same name all the time, e.g.
> slurmstepd.scope.
> So I first need to check if the scope exists and if so put the new
> slurmstepd process inside. But we still have the race condition, if during
> this phase all steps ends, systemd will do the cleanup.

Leave a stub process around in it.

> Problem number two, there's a significant delay since when creating the
> scope, until it is ready and the pid attached into it. The only way it
> worked was to put a 'sleep' after the dbus call and make my process wait
> for the async call to dbus to be materialized. This is really
> un-elegant.

If you want to synchronize in the cgroup creation to complete just
wait for the JobRemoved bus signal for the job returned by
StartTransientUnit().

> If instead I could just ask systemd to delegate a part of the tree for my
> processes, then everything would be solved.

I don't follow. You can enable delegation on the scope. I mean, that's
the reason I suggested to use a scope.

> Do you have any other suggestions?

Not really, except maybe: please read up on the documentation, it
explains a lot of the concepts.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-15 Thread Felip Moll
Another thing I have found is that if the process which created the scope
(e.g. my main process, slurmd) terminates, then the scope is stopped even
if I abandoned it and there's a pid inside.
So this makes the proposed solution not working. What am I missing?

● gamba11_slurmstepd.scope
 Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope;
transient)
 Transient: yes
 Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago
 Tasks: 1 (limit: 38333)
 Memory: 0B
 CPU: 0
 CGroup: /system.slice/gamba11_slurmstepd.scope
 └─system
 └─18000 /home/lipi/slurm/master/inst/sbin/slurmstepd
infinity


mar 15 10:40:53 llit systemd[1]: gamba11_slurmstepd.scope: Succeeded.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-14 Thread Felip Moll
Hi folks. I continued with my investigation on the best way to solve my
problem.
As suggested I am calling StartTransientUnit method with dbus (using
libdbus), to start a new scope.
Below are my impressions.

Firing an async D-Bus packet to systemd should be hardly measurable.
>
> But note that you can also run your main service as a service, and
> then allocate a *single* scope unit for *all* your payloads.


The main issue is the scope needs a pid attached to it. I thought that the
scope could live without any process inside, but that's not happening.
So every time a user step/job finishes, my main process must take care of
it, and launch the scope again on the next coming job.
There's also a race condition when a job is finishing and another one is
starting up, at this point the scope can be destroyed but the main process
may not realize it.

I also tried to leave the responsibility of setting up the scope to the
forked process itself, which is much easier to code and cleaner because of
how the software is designed.
The forked process just does the dbus call, and when the scope is ready it
is moved to the corresponding cgroup (PIDFile=).

Problem number one: if other processes are in the scope, the dbus call
won't work since I am using the same name all the time, e.g.
slurmstepd.scope.
So I first need to check if the scope exists and if so put the new
slurmstepd process inside. But we still have the race condition, if during
this phase all steps ends, systemd will do the cleanup.

Problem number two, there's a significant delay since when creating the
scope, until it is ready and the pid attached into it. The only way it
worked was to put a 'sleep' after the dbus call and make my process wait
for the async call to dbus to be materialized. This is really un-elegant.


> That way
> you can restart your main service unit independently of the scope
> unit, but you only have to issue a single request once for allocating
> the scope, and not for each of your payloads.
>
>
Yes. That is solved, I can restart slurmd now, but the other part is not
true as I just explained.
I need to issue new requests every time the scope is cleaned up by systemd.


> But that too means you have to issue a bus call. If you really don't
> like talking to systemd this is not going to work of course, but quite
> frankly, that's a problem you are making yourself, and I am not
> particularly sympathetic to it.
>
>
This is not a problem, but the delay of creating a scope plus it being
removed all the time is unacceptable.

My only idea now is to start a scope from the main process, adding a "sleep
infinity" pid inside, and discharge anyone to ever creating or calling to
dbus.
If instead I could just ask systemd to delegate a part of the tree for my
processes, then everything would be solved.

Do you have any other suggestions?


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote:

> I have read and studied all your suggestions and I understand them.
> I also did some performance tests in which I fork+executed a systemd-run to
> launch a service for every step and I got bad performance overall.
> One of our QA tests (test 9.8 of our testsuite) shows a decrease of
> performance of 3x.

systemd-run is synchronous, and unless you specify "--scope" it will
tell systemd to fork things off instead of doing that client-side,
which I understand is what you want to do. The fact it's synchronous,
i.e. waits for completion of the whole operation (including start-up
of dependencies and whatnot) necessarily means it's slow.

> > But note that you can also run your main service as a service, and
> > then allocate a *single* scope unit for *all* your payloads. That way
> > you can restart your main service unit independently of the scope
> > unit, but you only have to issue a single request once for allocating
> > the scope, and not for each of your payloads.
> >
> >
> My questions are, where would the scope reside? Does it have an associated
> cgroup?

Yes, I explicitly pointed you to them, it's why I suggested you use
them.

My recommendation if you hack on stuff like this is reading the docs
btw, specifically:

 https://systemd.io/CGROUP_DELEGATION

It pretty explicitly lists your options in the "Three Scenarios"
section.

It also explains what scope units are and when to use htme.

> I am also curious of what this sentence does exactly mean:
>
> "You might break systemd as a whole though (for example, add a process
> directly to a slice's cgroup and systemd will be very sad).".

if you add a process to a cgroup systemd manages that is supposed to
be an inner one in the tree, you will make creation of children fail
that way, and thus starting services and other operations will likely
start failing all over the place.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Felip Moll
Hi folks, I wanted to keep the case as generic as possible but I think it
is important at this point to comment on what we're talking about, so let
me clarify a little bit the case I am dealing with at the moment.

In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is
a HPC resource manager, and for the moment we're limited to Cgroup v1. We
actually use the freezer, memory, cpuset, cpuacct and devices controllers
in v1. We think it is already a good time to add a plugin to our software
to make it capable to run on unified systems, and since systemd is widely
used we want to do this integration as best as we can to coexist with
systemd and not get our pids moved or make systemd mad.

We have a 'slurmd' daemon running on every compute node, waiting for
communications from the controller. The controller submits different kinds
of RPCs to slurmd and at one point one RPC can instruct slurmd to start a
new job step for a specific uid. Slurmd then forks twice; the original
slurmd just ends and goes back to other work. The first fork (child) sets a
bunch of pipes and prepares initialization data, then forks again
generating a grandchild. The grandchild finally exec's the slurmstepd
daemon which will be receiving the initialization data, prepare the
cgroups, and finally fork+exec the user software. This can happen many
times in a second because a user can submit a "job array" which with one
single RPC call can submit thousands of steps, and at the same time
thousands of other steps can be finishing at the same time, so the work
that systemd would need to do starting up new scopes/services and/or
stopping them + monitoring all this stuff could be considerable.

After this introduction I have to say that we successfully managed to work
following systemd rules by just starting a unit file for slurmd with
Delegate=yes and creating our own hierarchy inside. Every slurmstepd would
be forked and started in the delegated cgroup and would create its
directory and move itself where it belongs to (always in the delegated
cgroup), according to our needs. Everything ran smoothly until when I
restarted slurmd and slurmstepds were still running in the cgroup, systemd
was unable to start slurmd again because the cgroup was not deleted, since
it was busy with directories and slurmstepds; main reason for this bug.

Note that one feature of Slurm is that one can upgrade/restart slurmd
without affecting running jobs (slurmstepds) in the compute node.

I have read and studied all your suggestions and I understand them.
I also did some performance tests in which I fork+executed a systemd-run to
launch a service for every step and I got bad performance overall.
One of our QA tests (test 9.8 of our testsuite) shows a decrease of
performance of 3x.

But, the positive thing is that we did a test to manually fork+exec one new
Delegated separate service when starting up slurmd, and we moved new forked
slurmstepd pids *manually* into the new cgroup associated with the new
service. This service contains a 'sleep infinity' as the main pid to make
the cgroup not disappear even if no slurmstepds are running. As I say, this
is a dirty test, which works.

After reading your last two emails, I think the most efficient way we need
to go is this one:

Firing an async D-Bus packet to systemd should be hardly measurable.
>
> But note that you can also run your main service as a service, and
> then allocate a *single* scope unit for *all* your payloads. That way
> you can restart your main service unit independently of the scope
> unit, but you only have to issue a single request once for allocating
> the scope, and not for each of your payloads.
>
>
My questions are, where would the scope reside? Does it have an associated
cgroup?
If I am a new slurmstepd, can I attach myself to this scope or must I be
attached by slurmd before being executed?


> But that too means you have to issue a bus call. If you really don't
> like talking to systemd this is not going to work of course, but quite
> frankly, that's a problem you are making yourself, and I am not
> particularly sympathetic to it.
>

I can study this option. It is not that I like or don't like talking to
systemd, but the idea is that Slurm must work in other OSes, possibly
without systemd but still with cgroup v2, and still be compatible with
cgroup v1 and with no cgroup at all. It's thinking about the future, the
less complexity and particularities it has, the more maintainable and
flexible the software is. I think this is understandable, but if this is
not possible at all we will have to adapt.


> > DelegateCgroupLeaf=. If set to yes an extra directory will be
> > created into the unit cgroup to place the newly spawned service process.
> > This is useful for services which need to be restarted while its forked
> > pids remain in the cgroup and the service cgroup is not a leaf
> > anymore.
>
> No. Let's not add that.
>

I could foresee the benefits of such an option, but I can 

Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote:

> Silvio,
>
> As I commented in my previous post, creating every single job in a separate
> slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra
> fork+execs plus waiting for systemd to fill up its internal structures and
> manage it all is a no-no.

Firing an async D-Bus packet to systemd should be hardly measurable.

But note that you can also run your main service as a service, and
then allocate a *single* scope unit for *all* your payloads. That way
you can restart your main service unit independently of the scope
unit, but you only have to issue a single request once for allocating
the scope, and not for each of your payloads.

But that too means you have to issue a bus call. If you really don't
like talking to systemd this is not going to work of course, but quite
frankly, that's a problem you are making yourself, and I am not
particularly sympathetic to it.

> One other option that I am thinking about is extending the parameters of a
> unit file, for example adding a DelegateCgroupLeaf=yes option.
>
> DelegateCgroupLeaf=. If set to yes an extra directory will be
> created into the unit cgroup to place the newly spawned service process.
> This is useful for services which need to be restarted while its forked
> pids remain in the cgroup and the service cgroup is not a leaf
> anymore.

No. Let's not add that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-03-03 Thread Lennart Poettering
On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote:

> > That's a bad idea typically, and a generally a hack: the unit should
> > probably be split up differently, i.e. the processes that shall stick
> > around on restart should probably be in their own unit, i.e. another
> > service or scope unit.
>
> So, if I understand it correctly you are suggesting that every forked
> process must be started through a new systemd unit?

systemd has two different unit types: services and scopes. Both group
processes in a cgroup. But only services are where systemd actually
forks+execs (i.e. "starts a process"). If you want to fork yourself, that's
fine, then a scope unit is your thing. If you use scope units you do
everything yourself, but as part of your setup you then tell systemd
to move your process into its own scope unit.

> If that's the case it seems inconvenient because we're talking about a job
> scheduler where sometimes may have thousands of forked processes executed
> quickly, and where performance is key.
> Having to manage a unit per each process will probably not work in this
> situation in terms of performance.

You don't really have to "manage" it. You can register a scope unit
asynchronously, it's firing off one dbus message basically at the same
time you fork things off, telling systemd to put it in a new scope unit.

> The other option I can imagine is to start a new unit from my daemon of
> Type=forking, which remains forever until I decide to clean it up even if
> it doesn't have any process inside.
> Then I could put my processes in the associated cgroup instead of inside
> the main daemon cgroup. Would that make sense?

Migrating processes wildly between cgroups is messy, because it fucks
up accounting and is restricted permission-wise. Typically you want to
create a cgroup and populate it, and then stick to that.

> The issue here is that for creating the new unit I'd need my daemon to
> depend on systemd libraries, or to do some fork-exec using systemd commands
> and parsing output.

To allocate a scope unit you'd have to fire off a D-Bus method
call. No need for any systemd libraries.

> I am trying to keep the dependencies at a minimum and I'd love to have an
> alternative.

Sorry, but if you want to rearrange processes in cgroups, or want
systemd to manage your processes orthogonal to the service concept you
have to talk to systemd.

> Yeah, I know and understand it is not supported, but I am more interested
> in the technical part of how things would break.
> I see in systemd/src/core/cgroup.c that it often differentiates a cgroup
> with delegation with one without it (!unit_cgroup_delegate(u)), but it's
> hard for me to find out how or where this exactly will mess up with any
> cgroup created outside of systemd. I'd appreciate it if you can give me
> some light on why/when/where things will break in practice, or just an
> example?

THis depends highly on what precisely you do. At best systemd will
complain or just override the changes you did outside of the tree you
got delegated. You might break systemd as a whole though (for example,
add a process directly to a slice's cgroup and systemd will be very
sad).

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Silvio Knizek
Am Montag, dem 21.02.2022 um 22:16 +0100 schrieb Felip Moll:
> Silvio,
>
> As I commented in my previous post, creating every single job in a
> separate slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra
> fork+execs plus waiting for systemd to fill up its internal
> structures and manage it all is a no-no.
And how about an xinitd style daemon, excepting connections and
spawning processes that way?
So instead of sgamba1.service you would have a sgamba1@.service and a
sgamba1.socket, spawning sgamba1@user1.service, sgamba1@user2.service,
etc. units.
So even if one user process dies, nothing else dies. And the setup
overhead would only be once everytime a user creates a new connection.
So they can still drop their one million jobs and you has still user
isolation.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Barry


> On 21 Feb 2022, at 21:16, Felip Moll  wrote:
> 
> 
> 
>> You could invoke a man:systemd-run for each new process. Than you can
>> put every single job in a seperate .slice with its own
>> man:systemd.resource-control applied.
>> This would also mean that you don't need to compile against libsystemd.
>> Just exec() accordingly if a systemd-system is detected.
>> 
>> BR
>> Silvio
> 
> Silvio,
> 
> As I commented in my previous post, creating every single job in a separate 
> slice is an overhead I cannot assume.
> An HTC system could run thousands of jobs per second, and doing extra 
> fork+execs plus waiting for systemd to fill up its internal structures and 
> manage it all is a no-no.

Are you assuming this or did you measure the cost?

Barry

> 
> One other option that I am thinking about is extending the parameters of a 
> unit file, for example adding a DelegateCgroupLeaf=yes option.
> 
> DelegateCgroupLeaf=. If set to yes an extra directory will be created 
> into the unit cgroup to place the newly spawned service process. This is 
> useful for services which need to be restarted while its forked pids remain 
> in the cgroup and the service cgroup is not a leaf anymore. This option is 
> only valid when using Delegate=yes and under a system in unified mode.
> 
> E.g. in my example, that would end up like this:
> /sys/fs/cgroup/system.slices/sgamba1.service   <-- This is Delegated=yes 
> DelegateMultiCgroups=yes
> ├── sgamba1   <-- The spawned process would be always put in here by 
> systemd.
> ├── user1_stuff
> ├── user2_stuff
> └── user3_stuff
> 
> I think this idea could work for cases like the one exposed here, and I see 
> this would be quite useful.
>  


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Felip Moll
You could invoke a man:systemd-run for each new process. Than you can
> put every single job in a seperate .slice with its own
> man:systemd.resource-control applied.
> This would also mean that you don't need to compile against libsystemd.
> Just exec() accordingly if a systemd-system is detected.
>
> BR
> Silvio
>

Silvio,

As I commented in my previous post, creating every single job in a separate
slice is an overhead I cannot assume.
An HTC system could run thousands of jobs per second, and doing extra
fork+execs plus waiting for systemd to fill up its internal structures and
manage it all is a no-no.

One other option that I am thinking about is extending the parameters of a
unit file, for example adding a DelegateCgroupLeaf=yes option.

DelegateCgroupLeaf=. If set to yes an extra directory will be
created into the unit cgroup to place the newly spawned service process.
This is useful for services which need to be restarted while its forked
pids remain in the cgroup and the service cgroup is not a leaf anymore.
This option is only valid when using Delegate=yes and under a system in
unified mode.

E.g. in my example, that would end up like this:
/sys/fs/cgroup/system.slices/sgamba1.service   <-- This is
Delegated=yes DelegateMultiCgroups=yes
├── sgamba1   <-- The spawned process would be always put in here by
systemd.
├── user1_stuff
├── user2_stuff
└── user3_stuff

I think this idea could work for cases like the one exposed here, and I see
this would be quite useful.


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Silvio Knizek
Am Montag, dem 21.02.2022 um 18:07 +0100 schrieb Felip Moll:
> The hard requirement that my project has is that processes need to
> live even if the daemon who forked them dies.
> Roughly it is how a batch scheduler works: one controller sends a
> request to my daemon for launching a process in the name of a user,
> my daemon forks-exec it. At some point my daemon can be stopped,
> restarted, upgraded, whatever but the forked processes need to always
> be alive because they are continuing their work. We are talking here
> about the HPC world.
You could invoke a man:systemd-run for each new process. Than you can
put every single job in a seperate .slice with its own
man:systemd.resource-control applied.
This would also mean that you don't need to compile against libsystemd.
Just exec() accordingly if a systemd-system is detected.

BR
Silvio


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Felip Moll
>
> Hmm? Hard requirement of what? Not following?
>
>
The hard requirement that my project has is that processes need to live
even if the daemon who forked them dies.
Roughly it is how a batch scheduler works: one controller sends a request
to my daemon for launching a process in the name of a user, my daemon
forks-exec it. At some point my daemon can be stopped, restarted, upgraded,
whatever but the forked processes need to always be alive because they are
continuing their work. We are talking here about the HPC world.


> You are leaving processes around when your service dies/restarts?
>

Yes.


> That's a bad idea typically, and a generally a hack: the unit should
> probably be split up differently, i.e. the processes that shall stick
> around on restart should probably be in their own unit, i.e. another
> service or scope unit.
>

So, if I understand it correctly you are suggesting that every forked
process must be started through a new systemd unit?
If that's the case it seems inconvenient because we're talking about a job
scheduler where sometimes may have thousands of forked processes executed
quickly, and where performance is key.
Having to manage a unit per each process will probably not work in this
situation in terms of performance.

The other option I can imagine is to start a new unit from my daemon of
Type=forking, which remains forever until I decide to clean it up even if
it doesn't have any process inside.
Then I could put my processes in the associated cgroup instead of inside
the main daemon cgroup. Would that make sense?

The issue here is that for creating the new unit I'd need my daemon to
depend on systemd libraries, or to do some fork-exec using systemd commands
and parsing output.
I am trying to keep the dependencies at a minimum and I'd love to have an
alternative.


> That's not supported. You may only create your own cgroups where you
> turned on delegation, otherwise all bets are off. If you put stuff in
> /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's
> "-.slice" without telling it so, and things will break sooner or
> later, and often in non-obvious ways.
>

Yeah, I know and understand it is not supported, but I am more interested
in the technical part of how things would break.
I see in systemd/src/core/cgroup.c that it often differentiates a cgroup
with delegation with one without it (!unit_cgroup_delegate(u)), but it's
hard for me to find out how or where this exactly will mess up with any
cgroup created outside of systemd. I'd appreciate it if you can give me
some light on why/when/where things will break in practice, or just an
example?

I am also aware of the single-writer policy that systemd has in its
documentation, and I am aware that this is not supported, but I'd like to
understand exactly what can happen.


Thanks for your help & time :)


Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart

2022-02-21 Thread Lennart Poettering
On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote:

> Hello,
>
> I am creating a software which consists of one daemon which forks several
> processes from user requests.
> This is basically acting like a job scheduler.
>
> The daemon is started using a unit file and with Delegate=yes option,
> because every process must be constrained differently. I manage my cgroup
> hierarchy, create some leaves into the tree and put each pid there.
> For example, after starting up the service and receiving 3 user requests, a
> tree under /sys/fs/cgroup/system.slice/ could look like:
>
> sgamba1.service/
> ├── daemon_pid
> ├── user1_stuff
> ├── user2_stuff
> └── user3_stuff
>
> I create the hierarchy and set cgroup.subtree_control in the root directory
> (sgamba1.service in the example) and everything runs smoothly, until when I
> decide to restart my service.
>
> The service then cannot restart:
>
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to
> cgroup /system.slice/sgamba1.service: Device or resource busy
> feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step
> CGROUP spawning /path_to_bin/mydaemond: Device or resource busy
>
> This is because systemd tries to put the pid of the new daemon in
> sgamba1.service/cgroup.procs and this would break the "no internal process
> constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore
> because it has subtree_control enabled for the user stuff.
>
> One hard requirement is that user stuff must live even if the service is
> restarted.

Hmm? Hard requirement of what? Not following?

You are leaving processes around when your service dies/restarts?
That's a bad idea typically, and a generally a hack: the unit should
probably be split up differently, i.e. the processes that shall stick
around on restart should probably be in their own unit, i.e. another
service or scope unit.

> What's the way to achieve that? I see one easy way, which is to move user
> stuff into its own cgroup and out of sgamba1.service/, but then it will run
> outside a Delegate=yes unit. What can happen then?
> Will systemd eventually migrate my processes?
> How do services workaround that issue?
> If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will
> systemd touch my directories?

That's not supported. You may only create your own cgroups where you
turned on delegation, otherwise all bets are off. If you but stuff in
/sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's
"-.slice" without telling it so, and things will break sooner or
later, and often in non-obvious ways.

Lennart

--
Lennart Poettering, Berlin