Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 24.03.22 14:32, Benjamin Berg (benja...@sipsolutions.net) wrote: > HI, > > On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote: > > False, the JobRemoved signal returns the id, job, unit and result. To > > wait for JobRemoved only needs a matching rule for this signal. The > > matching rule can just contain the path. In fact, nothing else than > > strings can be matched in a rule, so I may be only able to use the > > path. > > I think you need to add a wildcard match before the job is created > (i.e. before StartTransientUnit). Otherwise registering the match rule > (using the job's object path) will race with systemd signalling that > the job has completed. Correct. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 24.03.22 00:45, Felip Moll (fe...@schedmd.com) wrote: > Hi, some days ago we were talking about this: > > > > > Problem number two, there's a significant delay since when creating the > > > scope, until it is ready and the pid attached into it. The only way it > > > worked was to put a 'sleep' after the dbus call and make my process wait > > > for the async call to dbus to be materialized. This is really > > > un-elegant. > > > > If you want to synchronize in the cgroup creation to complete just > > wait for the JobRemoved bus signal for the job returned by > > StartTransientUnit(). > > > > > StartTransientUnit returns a string to a job object path. To call > JobRemoved I need the job id, so the easier way to get it is to strip the > last part of the returned string from StartTransientUnit job object path. > Am I right? JobRemoved is a signal, not a method call. i.e. not something you call, but you are notified about. And it originates from an object and objects have object paths in D-Bus. > Once I have the job id, I can then subscribe to JobRemoved bus signal for > the recently created job, but what happens if during the time I am > obtaining the ID or parsing the output, the job is finished? Will I lose > the signal? Yes. D-Bus sucks that way. You ave to subscribe to all jobs first, and the filte rout the ones you don#t want. > What is the correct order of doing a StartTransientUnit and wait for the > job to be finished (done, failed, whatever) ? first subscribe to JobRemoved, then issue StartTransientUnit, and then wait until you see JobRemoved for the unit you just started. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
HI, On Thu, 2022-03-24 at 12:40 +0100, Felip Moll wrote: > False, the JobRemoved signal returns the id, job, unit and result. To > wait for JobRemoved only needs a matching rule for this signal. The > matching rule can just contain the path. In fact, nothing else than > strings can be matched in a rule, so I may be only able to use the > path. I think you need to add a wildcard match before the job is created (i.e. before StartTransientUnit). Otherwise registering the match rule (using the job's object path) will race with systemd signalling that the job has completed. Benjamin signature.asc Description: This is a digitally signed message part
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
I respond to myself to the first part of my question. If you want to synchronize in the cgroup creation to complete just >> wait for the JobRemoved bus signal for the job returned by >> StartTransientUnit(). >> >> > StartTransientUnit returns a string to a job object path. To call > JobRemoved I need the job id, so the easier way to get it is to strip the > last part of the returned string from StartTransientUnit job object path. > Am I right? > > False, the JobRemoved signal returns the id, job, unit and result. To wait for JobRemoved only needs a matching rule for this signal. The matching rule can just contain the path. In fact, nothing else than strings can be matched in a rule, so I may be only able to use the path. >
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Hi, some days ago we were talking about this: > > Problem number two, there's a significant delay since when creating the > > scope, until it is ready and the pid attached into it. The only way it > > worked was to put a 'sleep' after the dbus call and make my process wait > > for the async call to dbus to be materialized. This is really > > un-elegant. > > If you want to synchronize in the cgroup creation to complete just > wait for the JobRemoved bus signal for the job returned by > StartTransientUnit(). > > StartTransientUnit returns a string to a job object path. To call JobRemoved I need the job id, so the easier way to get it is to strip the last part of the returned string from StartTransientUnit job object path. Am I right? Once I have the job id, I can then subscribe to JobRemoved bus signal for the recently created job, but what happens if during the time I am obtaining the ID or parsing the output, the job is finished? Will I lose the signal? What is the correct order of doing a StartTransientUnit and wait for the job to be finished (done, failed, whatever) ?
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Wed, Mar 16, 2022 at 6:29 PM Felip Moll wrote: > > On Wed, Mar 16, 2022 at 5:53 PM Lennart Poettering > wrote: > >> On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote: >> >> > AFAIK RemainAfterExit for services actually does cleanup the cgroup >> tree if >> > there are no more processes in it. >> >> It doesn't do that if delegation is on (iirc, if not I'd consider that >> a bug). Same logic should apply here. >> >> > I will recheck that, but I am quite sure that on some tests I did the > cgroup was cleaned up on a delegated service after the main pid terminated. > > Is that a bug then? 1. Start a service with Delegate=yes and RemainAfterExit 2. Wait for the main process to start 3. Check that the unit is still active 4. Check that the cgroup is still there <--- It is gone when no pids in it ]# systemd-run -u test -p "Delegate=yes" -p "RemainAfterExit=yes" sleep 60 Running as unit: test.service ]# systemctl status test.service ● test.service - /usr/bin/sleep 60 Loaded: loaded (/run/systemd/transient/test.service; transient) Transient: yes Active: active (running) since Fri 2022-03-18 09:47:32 CET; 5s ago Main PID: 6083 (sleep) Tasks: 1 (limit: 14068) Memory: 316.0K CGroup: /system.slice/test.service └─6083 /usr/bin/sleep 60 de març 18 09:47:32 llagosti systemd[1]: Started /usr/bin/sleep 60. ]# cat /proc/6083/cgroup 12:perf_event:/ 11:pids:/system.slice/test.service 10:devices:/system.slice/test.service 9:cpuset:/ 8:blkio:/system.slice/test.service 7:net_cls,net_prio:/ 6:memory:/system.slice/test.service 5:misc:/ 4:cpu,cpuacct:/system.slice/test.service 3:hugetlb:/ 2:freezer:/ 1:name=systemd:/system.slice/test.service 0::/system.slice/test.service ]# ls /sys/fs/cgroup/memory/system.slice/test.service/ cgroup.clone_children memory.kmem.failcnt memory.kme... .. [root@llagosti slurm.gitlab.lipixx]# systemctl status test.service ● test.service - /usr/bin/sleep 60 Loaded: loaded (/run/systemd/transient/test.service; transient) Transient: yes Active: active (exited) since Fri 2022-03-18 09:47:32 CET; 1min 21s ago Process: 6083 ExecStart=/usr/bin/sleep 60 (code=exited, status=0/SUCCESS) Main PID: 6083 (code=exited, status=0/SUCCESS) de març 18 09:47:32 llagosti systemd[1]: Started /usr/bin/sleep 60. ]# ls /sys/fs/cgroup/memory/system.slice/test.service/ ls: cannot access '/sys/fs/cgroup/memory/system.slice/test.service/': No such file or directory ]# systemctl cat test.service # /run/systemd/transient/test.service # This is a transient unit file, created programmatically via the systemd API. Do not edit. [Unit] Description=/usr/bin/sleep 60 [Service] Delegate=yes RemainAfterExit=yes ExecStart= ExecStart="/usr/bin/sleep" "60"
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Wed, Mar 16, 2022 at 5:53 PM Lennart Poettering wrote: > On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote: > > > AFAIK RemainAfterExit for services actually does cleanup the cgroup tree > if > > there are no more processes in it. > > It doesn't do that if delegation is on (iirc, if not I'd consider that > a bug). Same logic should apply here. > > I will recheck that, but I am quite sure that on some tests I did the cgroup was cleaned up on a delegated service after the main pid terminated. > > If that behavior of keeping the cgroup tree even if there are no pids is > > what you agree with, then I coincide is a good idea to include this > option > > to scopes. > > Yes, that is what I was suggesting this would do. > > Excellent. > > Or are you saying that I can just migrate processes wildly without > > informing systemd and just doing an 'echo > cgroup.procs' from one > > non-delegated tree to my delegated subtree? > > yeah, you can do that. > > Ok, so I understood that incorrectly from a former paragraph you wrote in our first e-mails. you said: > Migrating processes wildly between cgroups is messy, because it fucks > up accounting and is restricted permission-wise. Typically you want to > create a cgroup and populate it, and then stick to that. There, I understood you were referring to "systemd" accounting, not "kernel" accounting. This has been a big misunderstanding for this issue. > Note that (independently of systemd) you shouldn't migrate stuff to > aggressively, since it fucks up kernel resource accounting. i.e. it is > wise to minimize process migration in cgroups and always migrate plus > shortly after exec() > Yeah, that makes sense and I am aware of it. I am migrating before any real work is done, exactly as you describe. I will continue a bit more with this and inform you on what I see, but we seem to be close to a solution. Thank you!.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 17:35, Michal Koutný (mkou...@suse.com) wrote: > True, in the unified mode it should be safe doing manually. > I was worried about migrating e.g. MainPID of a service into this scope > but PID1 should handle that AFAICS. Also since this has to be performed > by the privileged user (scopes are root's), the manual migration works. This is actually a common case: for getty style login process the main process of the getty service will migrate to the new scope. A service is thus always a cgroup *and* a main pid for us, in case the main pid is outside of the cgroup. And conversely, a process can be associated to multiple units this way. It can be main pid of one service and be in a cgroup of a scope. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 17:30, Felip Moll (fe...@schedmd.com) wrote: > > > (The above is slightly misleading) there could be an alternative of > > > something like RemainAfterExit=yes for scopes, i.e. such scopes would > > > not be stopped after last process exiting (but systemd would still be in > > > charge of cleaning the cgroup after explicit stop request and that'd > > > also mark the scope as truly stopped). > > > > Yeah, I'd be fine with adding RemainAfterExit= to scope units > > > > > Note that what Michal is saying is "something like RemainAfterExit=yes for > scopes", which means systemd would NOT clean up the cgroup tree when there > are no processes inside. > AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if > there are no more processes in it. It doesn't do that if delegation is on (iirc, if not I'd consider that a bug). Same logic should apply here. > If that behavior of keeping the cgroup tree even if there are no pids is > what you agree with, then I coincide is a good idea to include this option > to scopes. Yes, that is what I was suggesting this would do. > > > Such a recycled scope would only be useful via > > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). > > > > Well, if delegation is on, then people don#t really have to use our > > API, they can just do that themselves. > > That's not exact. If slurmd (my main process) forks a slurmstepd (child > process) and I want to move slurmstepd into a delegated subtree from the > scope I already created, I must use AttachProcessesToUnit(), isn't that > true? depends on your privs. You can just move it yourself if you have enough privs. See commit msg in 6592b9759cae509b407a3b49603498468bf5d276 > Or are you saying that I can just migrate processes wildly without > informing systemd and just doing an 'echo > cgroup.procs' from one > non-delegated tree to my delegated subtree? yeah, you can do that. Note that (independently of systemd) you shouldn't migrate stuff to aggressively, since it fucks up kernel resource accounting. i.e. it is wise to minimize process migration in cgroups and always migrate plus shortly after exec(), or even better do a clone(CLONE_INTO_CGROUP) – though unfortunately the latter cannot work with glibc right now :-(. i.e. keeping processes that already "have history" around for a long time after migration kinda sucks. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Wed, Mar 16, 2022 at 05:06:28PM +0100, Lennart Poettering wrote: > > That owner would be a process -- bang, you created a service with > > delegation or a scope with "keepalive" process. > > can't parse this. That was meant as a humorous proof by contradiction that delegation on slices is unnecessary. Nvm. > > (The above is slightly misleading) there could be an alternative of > > something like RemainAfterExit=yes for scopes, i.e. such scopes would > > not be stopped after last process exiting (but systemd would still be in > > charge of cleaning the cgroup after explicit stop request and that'd > > also mark the scope as truly stopped). > > Yeah, I'd be fine with adding RemainAfterExit= to scope units Felip, I'd happily review such a PR ;-) > > Such a recycled scope would only be useful via > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). > > Well, if delegation is on, then people don#t really have to use our > API, they can just do that themselves. True, in the unified mode it should be safe doing manually. I was worried about migrating e.g. MainPID of a service into this scope but PID1 should handle that AFAICS. Also since this has to be performed by the privileged user (scopes are root's), the manual migration works. Michal
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> > (The above is slightly misleading) there could be an alternative of > > something like RemainAfterExit=yes for scopes, i.e. such scopes would > > not be stopped after last process exiting (but systemd would still be in > > charge of cleaning the cgroup after explicit stop request and that'd > > also mark the scope as truly stopped). > > Yeah, I'd be fine with adding RemainAfterExit= to scope units > > Note that what Michal is saying is "something like RemainAfterExit=yes for scopes", which means systemd would NOT clean up the cgroup tree when there are no processes inside. AFAIK RemainAfterExit for services actually does cleanup the cgroup tree if there are no more processes in it. If that behavior of keeping the cgroup tree even if there are no pids is what you agree with, then I coincide is a good idea to include this option to scopes. > > Such a recycled scope would only be useful via > > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). > > Well, if delegation is on, then people don#t really have to use our > API, they can just do that themselves. > > That's not exact. If slurmd (my main process) forks a slurmstepd (child process) and I want to move slurmstepd into a delegated subtree from the scope I already created, I must use AttachProcessesToUnit(), isn't that true? Or are you saying that I can just migrate processes wildly without informing systemd and just doing an 'echo > cgroup.procs' from one non-delegated tree to my delegated subtree?
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mi, 16.03.22 16:15, Felip Moll (fe...@schedmd.com) wrote: > On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný wrote: > > > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll > > wrote: > > > Meaning that it would be great to have a delegated cgroup subtree without > > > the need of a service or scope. > > > Just an empty subtree. > > > > It looks appealing to add Delegate= directive to slice units. > > Firstly, that'd prevent the use of the slice by anything systemd. > > Then some notion of owner of that subtree would have to be defined (if > > only for cleanup). > > That owner would be a process -- bang, you created a service with > > delegation or a scope with "keepalive" process. > > > > > Correct, this is how the current systemd design works. > But... what if the concept of owner was irrelevant? What if we could just > tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it > or do anything to it or pids residing into it. No, that's not something we will offer. We bind a lot of meaning to the cgroup concept. i.e. we derive unit info from it, and many things are based on that. For example any client logging to journald will do so from a cgroup and we pick that up to know which service logging is from, and store that away and use it for filtering, for picking per-unit log settings and so on. Moreover we need to be able to shutdown all processes on the system in a systematic way for shutdown, and we do that based on units, and the ordering between them. Having processes and cgroups that live entirely independent makes a total mess from this. And there's a lot more, like resource mgmt: we want that all processes on the system are placed in a unit of some form so that we can apply useful resource mgmt to it. So yes you can have a delegated subtree, if you like and we'll not interfere with what you do there mostly, but it must be a leaf of our tree, and we'll "macro manage" it for you, i.e. define a lifetime for it, and track processes back to it. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 17:24, Michal Koutný (mkou...@suse.com) wrote: > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll > wrote: > > Meaning that it would be great to have a delegated cgroup subtree without > > the need of a service or scope. > > Just an empty subtree. > > It looks appealing to add Delegate= directive to slice units. Hm? Slice units are *inner* node of *our* cgroup trees. if we'd allow delegation of that, then we'd could not put stuff inside it, hence it wouldn't be a slice because it couldn#t contain anything anymore. > Firstly, that'd prevent the use of the slice by anything systemd. yeah, precisely? i don't follow. What would a slice with delegation be that a scope with delegation isn't already? > Then some notion of owner of that subtree would have to be defined (if > only for cleanup). scopes already have that, so why not use that? > That owner would be a process -- bang, you created a service with > delegation or a scope with "keepalive" process. can't parse this. > (The above is slightly misleading) there could be an alternative of > something like RemainAfterExit=yes for scopes, i.e. such scopes would > not be stopped after last process exiting (but systemd would still be in > charge of cleaning the cgroup after explicit stop request and that'd > also mark the scope as truly stopped). Yeah, I'd be fine with adding RemainAfterExit= to scope units > Such a recycled scope would only be useful via > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). Well, if delegation is on, then people don#t really have to use our API, they can just do that themselves. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 16:35, Felip Moll (fe...@schedmd.com) wrote: > > I don't follow. You can enable delegation on the scope. I mean, that's > > the reason I suggested to use a scope. > > > > > Meaning that it would be great to have a delegated cgroup subtree without > the need of a service or scope. > Just an empty subtree. That's what a scope is. I don't follow? What do you think a scope is beyond that? It just encapsulates a cgroup subtree. It auto-cleans it though once it goes empty, and because it does that it also requires you to provide at least one PID to add to the scope when it is created. For services we have a RemainAfterExit= property btw. There were requests for adding the same for scopes. I'd be fine with adding that, happy to take a patch. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný wrote: > On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll > wrote: > > Meaning that it would be great to have a delegated cgroup subtree without > > the need of a service or scope. > > Just an empty subtree. > > It looks appealing to add Delegate= directive to slice units. > Firstly, that'd prevent the use of the slice by anything systemd. > Then some notion of owner of that subtree would have to be defined (if > only for cleanup). > That owner would be a process -- bang, you created a service with > delegation or a scope with "keepalive" process. > > Correct, this is how the current systemd design works. But... what if the concept of owner was irrelevant? What if we could just tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it or do anything to it or pids residing into it. > (The above is slightly misleading) there could be an alternative of > something like RemainAfterExit=yes for scopes, i.e. such scopes would > not be stopped after last process exiting (but systemd would still be in > charge of cleaning the cgroup after explicit stop request and that'd > also mark the scope as truly stopped). > Such a recycled scope would only be useful via > org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). > > This is also a good idea. > BTW I'm also wondering how do you detect a job finishing in the case > original parent is gone (due to main service restart) and job's main > process reparented? > > slurmstepd connects to slurmd through socket and sends an RPC. If slurmd is gone, slurmstepd (child) will retry the RPC and remain until slurmd appears again and responds. The main process doesn't wait for their child, but instead we do a double fork to make the child be parented by init process 1. > BTW 2 You didn't like having a scope for each job. Is it because of the > setup time (IOW jobs are short-lived) or persistent scopes overhead (too > many units, PID1 scalability)? > It is not that I didn't like it. It is that I observed a delay in step creation (fork slurmstepd) because sending an async dbus message required the stepd to wait for the systemd job to be executed, and it can take time; computationally a lot more than just a mkdir on the cgroup subtree. Just to put an example, a 'srun hostname' command starts a job which runs a hostname. Response is instantaneous with mkdir's but it takes almost 1 second with a call to systemd through dbus. Slurm is used for HPC, but also for HTC (High Throughput Computing), which means hundreds of jobs can be started in a short period of time, so yes, this delay is critical, and not only because jobs can be short-lived, but there can be a massive job finish + job start at the same time. I just ran one test of our regression and 'systemctl list-unit-files' responsiveness was compromised. Also from the point of view of a sysadmin this was not ideal, so as you say scalability of PID1 is also a concern. This is the reason I will not be using 1 scope per job, and I prefer the other solution to have 1 single scope with Delegate=yes. Does it make sense?
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll wrote: > Meaning that it would be great to have a delegated cgroup subtree without > the need of a service or scope. > Just an empty subtree. It looks appealing to add Delegate= directive to slice units. Firstly, that'd prevent the use of the slice by anything systemd. Then some notion of owner of that subtree would have to be defined (if only for cleanup). That owner would be a process -- bang, you created a service with delegation or a scope with "keepalive" process. (The above is slightly misleading) there could be an alternative of something like RemainAfterExit=yes for scopes, i.e. such scopes would not be stopped after last process exiting (but systemd would still be in charge of cleaning the cgroup after explicit stop request and that'd also mark the scope as truly stopped). Such a recycled scope would only be useful via org.freedesktop.systemd1.Manager.AttachProcessesToUnit(). BTW I'm also wondering how do you detect a job finishing in the case original parent is gone (due to main service restart) and job's main process reparented? BTW 2 You didn't like having a scope for each job. Is it because of the setup time (IOW jobs are short-lived) or persistent scopes overhead (too many units, PID1 scalability)? Michal
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> It's shown as active, so where is the problem? > > I have found the problem. I start my main process (slurmd) on a terminal, which then forks-exec a /bin/sleep infinity and creates a new scope adding the pid of the sleep. If the slurmd is terminated with ctrl+c then the child processes die, so the scope is destroyed. So I need to daemonize the sleep. Or... use a service directly.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Tue, Mar 15, 2022 at 1:29 PM Lennart Poettering wrote: > On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote: > > > > But note that you can also run your main service as a service, and > > > then allocate a *single* scope unit for *all* your payloads. > > > > The main issue is the scope needs a pid attached to it. I thought that > the > > scope could live without any process inside, but that's not happening. > > So every time a user step/job finishes, my main process must take care of > > it, and launch the scope again on the next coming job. > > Leave a stub process around in it. i.e something similar to > "/bin/sleep infinity". > > Ok.. this was my initial idea. > > The forked process just does the dbus call, and when the scope is ready > it > > is moved to the corresponding cgroup (PIDFile=). > > Hmm? PIDFile= is a property of *services*, not *scopes*. > > Sorry I meant PIDs, not PIDFile of course. > And "scopes" cannot be moved to "cgroups". I cannot parse the above. > > The forked process X does the dbus call to start the scope with PIDs=$(pidof X), and when the scope is ready, X is moved into the scope cgroup. > Did you read up on scopes and services? > > See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of > "scopes". Scopes *have* cgroups, but cannot be moved to "cgroups". > > Yes, it was a misunderstanding of my previous sentence. > > Problem number one: if other processes are in the scope, the dbus call > > won't work since I am using the same name all the time, e.g. > > slurmstepd.scope. > > So I first need to check if the scope exists and if so put the new > > slurmstepd process inside. But we still have the race condition, if > during > > this phase all steps ends, systemd will do the cleanup. > > Leave a stub process around in it. Ok, then I don't see the real difference of starting up a new service. > > If instead I could just ask systemd to delegate a part of the tree for my > > processes, then everything would be solved. > > I don't follow. You can enable delegation on the scope. I mean, that's > the reason I suggested to use a scope. > > Meaning that it would be great to have a delegated cgroup subtree without the need of a service or scope. Just an empty subtree. > > Do you have any other suggestions? > > Not really, except maybe: please read up on the documentation, it > explains a lot of the concepts. > > I've done, I may not be expressing myself perfectly though. I apologize for that.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Di, 15.03.22 10:50, Felip Moll (fe...@schedmd.com) wrote: > Another thing I have found is that if the process which created the scope > (e.g. my main process, slurmd) terminates, then the scope is stopped even > if I abandoned it and there's a pid inside. > So this makes the proposed solution not working. What am I missing? > > ● gamba11_slurmstepd.scope > Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope; > transient) > Transient: yes > Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago It's shown as active, so where is the problem? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 14.03.22 23:12, Felip Moll (fe...@schedmd.com) wrote: > > But note that you can also run your main service as a service, and > > then allocate a *single* scope unit for *all* your payloads. > > The main issue is the scope needs a pid attached to it. I thought that the > scope could live without any process inside, but that's not happening. > So every time a user step/job finishes, my main process must take care of > it, and launch the scope again on the next coming job. Leave a stub process around in it. i.e something similar to "/bin/sleep infinity". > The forked process just does the dbus call, and when the scope is ready it > is moved to the corresponding cgroup (PIDFile=). Hmm? PIDFile= is a property of *services*, not *scopes*. And "scopes" cannot be moved to "cgroups". I cannot parse the above. Did you read up on scopes and services? See https://systemd.io/CGROUP_DELEGATION/, it explains the concept of "scopes". Scopes *have* cgroups, but cannot be moved to "cgroups". > Problem number one: if other processes are in the scope, the dbus call > won't work since I am using the same name all the time, e.g. > slurmstepd.scope. > So I first need to check if the scope exists and if so put the new > slurmstepd process inside. But we still have the race condition, if during > this phase all steps ends, systemd will do the cleanup. Leave a stub process around in it. > Problem number two, there's a significant delay since when creating the > scope, until it is ready and the pid attached into it. The only way it > worked was to put a 'sleep' after the dbus call and make my process wait > for the async call to dbus to be materialized. This is really > un-elegant. If you want to synchronize in the cgroup creation to complete just wait for the JobRemoved bus signal for the job returned by StartTransientUnit(). > If instead I could just ask systemd to delegate a part of the tree for my > processes, then everything would be solved. I don't follow. You can enable delegation on the scope. I mean, that's the reason I suggested to use a scope. > Do you have any other suggestions? Not really, except maybe: please read up on the documentation, it explains a lot of the concepts. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Another thing I have found is that if the process which created the scope (e.g. my main process, slurmd) terminates, then the scope is stopped even if I abandoned it and there's a pid inside. So this makes the proposed solution not working. What am I missing? ● gamba11_slurmstepd.scope Loaded: loaded (/run/systemd/transient/gamba11_slurmstepd.scope; transient) Transient: yes Active: active (abandoned) since Tue 2022-03-15 10:40:34 CET; 4s ago Tasks: 1 (limit: 38333) Memory: 0B CPU: 0 CGroup: /system.slice/gamba11_slurmstepd.scope └─system └─18000 /home/lipi/slurm/master/inst/sbin/slurmstepd infinity mar 15 10:40:53 llit systemd[1]: gamba11_slurmstepd.scope: Succeeded.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Hi folks. I continued with my investigation on the best way to solve my problem. As suggested I am calling StartTransientUnit method with dbus (using libdbus), to start a new scope. Below are my impressions. Firing an async D-Bus packet to systemd should be hardly measurable. > > But note that you can also run your main service as a service, and > then allocate a *single* scope unit for *all* your payloads. The main issue is the scope needs a pid attached to it. I thought that the scope could live without any process inside, but that's not happening. So every time a user step/job finishes, my main process must take care of it, and launch the scope again on the next coming job. There's also a race condition when a job is finishing and another one is starting up, at this point the scope can be destroyed but the main process may not realize it. I also tried to leave the responsibility of setting up the scope to the forked process itself, which is much easier to code and cleaner because of how the software is designed. The forked process just does the dbus call, and when the scope is ready it is moved to the corresponding cgroup (PIDFile=). Problem number one: if other processes are in the scope, the dbus call won't work since I am using the same name all the time, e.g. slurmstepd.scope. So I first need to check if the scope exists and if so put the new slurmstepd process inside. But we still have the race condition, if during this phase all steps ends, systemd will do the cleanup. Problem number two, there's a significant delay since when creating the scope, until it is ready and the pid attached into it. The only way it worked was to put a 'sleep' after the dbus call and make my process wait for the async call to dbus to be materialized. This is really un-elegant. > That way > you can restart your main service unit independently of the scope > unit, but you only have to issue a single request once for allocating > the scope, and not for each of your payloads. > > Yes. That is solved, I can restart slurmd now, but the other part is not true as I just explained. I need to issue new requests every time the scope is cleaned up by systemd. > But that too means you have to issue a bus call. If you really don't > like talking to systemd this is not going to work of course, but quite > frankly, that's a problem you are making yourself, and I am not > particularly sympathetic to it. > > This is not a problem, but the delay of creating a scope plus it being removed all the time is unacceptable. My only idea now is to start a scope from the main process, adding a "sleep infinity" pid inside, and discharge anyone to ever creating or calling to dbus. If instead I could just ask systemd to delegate a part of the tree for my processes, then everything would be solved. Do you have any other suggestions?
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Do, 03.03.22 18:35, Felip Moll (fe...@schedmd.com) wrote: > I have read and studied all your suggestions and I understand them. > I also did some performance tests in which I fork+executed a systemd-run to > launch a service for every step and I got bad performance overall. > One of our QA tests (test 9.8 of our testsuite) shows a decrease of > performance of 3x. systemd-run is synchronous, and unless you specify "--scope" it will tell systemd to fork things off instead of doing that client-side, which I understand is what you want to do. The fact it's synchronous, i.e. waits for completion of the whole operation (including start-up of dependencies and whatnot) necessarily means it's slow. > > But note that you can also run your main service as a service, and > > then allocate a *single* scope unit for *all* your payloads. That way > > you can restart your main service unit independently of the scope > > unit, but you only have to issue a single request once for allocating > > the scope, and not for each of your payloads. > > > > > My questions are, where would the scope reside? Does it have an associated > cgroup? Yes, I explicitly pointed you to them, it's why I suggested you use them. My recommendation if you hack on stuff like this is reading the docs btw, specifically: https://systemd.io/CGROUP_DELEGATION It pretty explicitly lists your options in the "Three Scenarios" section. It also explains what scope units are and when to use htme. > I am also curious of what this sentence does exactly mean: > > "You might break systemd as a whole though (for example, add a process > directly to a slice's cgroup and systemd will be very sad).". if you add a process to a cgroup systemd manages that is supposed to be an inner one in the tree, you will make creation of children fail that way, and thus starting services and other operations will likely start failing all over the place. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Hi folks, I wanted to keep the case as generic as possible but I think it is important at this point to comment on what we're talking about, so let me clarify a little bit the case I am dealing with at the moment. In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is a HPC resource manager, and for the moment we're limited to Cgroup v1. We actually use the freezer, memory, cpuset, cpuacct and devices controllers in v1. We think it is already a good time to add a plugin to our software to make it capable to run on unified systems, and since systemd is widely used we want to do this integration as best as we can to coexist with systemd and not get our pids moved or make systemd mad. We have a 'slurmd' daemon running on every compute node, waiting for communications from the controller. The controller submits different kinds of RPCs to slurmd and at one point one RPC can instruct slurmd to start a new job step for a specific uid. Slurmd then forks twice; the original slurmd just ends and goes back to other work. The first fork (child) sets a bunch of pipes and prepares initialization data, then forks again generating a grandchild. The grandchild finally exec's the slurmstepd daemon which will be receiving the initialization data, prepare the cgroups, and finally fork+exec the user software. This can happen many times in a second because a user can submit a "job array" which with one single RPC call can submit thousands of steps, and at the same time thousands of other steps can be finishing at the same time, so the work that systemd would need to do starting up new scopes/services and/or stopping them + monitoring all this stuff could be considerable. After this introduction I have to say that we successfully managed to work following systemd rules by just starting a unit file for slurmd with Delegate=yes and creating our own hierarchy inside. Every slurmstepd would be forked and started in the delegated cgroup and would create its directory and move itself where it belongs to (always in the delegated cgroup), according to our needs. Everything ran smoothly until when I restarted slurmd and slurmstepds were still running in the cgroup, systemd was unable to start slurmd again because the cgroup was not deleted, since it was busy with directories and slurmstepds; main reason for this bug. Note that one feature of Slurm is that one can upgrade/restart slurmd without affecting running jobs (slurmstepds) in the compute node. I have read and studied all your suggestions and I understand them. I also did some performance tests in which I fork+executed a systemd-run to launch a service for every step and I got bad performance overall. One of our QA tests (test 9.8 of our testsuite) shows a decrease of performance of 3x. But, the positive thing is that we did a test to manually fork+exec one new Delegated separate service when starting up slurmd, and we moved new forked slurmstepd pids *manually* into the new cgroup associated with the new service. This service contains a 'sleep infinity' as the main pid to make the cgroup not disappear even if no slurmstepds are running. As I say, this is a dirty test, which works. After reading your last two emails, I think the most efficient way we need to go is this one: Firing an async D-Bus packet to systemd should be hardly measurable. > > But note that you can also run your main service as a service, and > then allocate a *single* scope unit for *all* your payloads. That way > you can restart your main service unit independently of the scope > unit, but you only have to issue a single request once for allocating > the scope, and not for each of your payloads. > > My questions are, where would the scope reside? Does it have an associated cgroup? If I am a new slurmstepd, can I attach myself to this scope or must I be attached by slurmd before being executed? > But that too means you have to issue a bus call. If you really don't > like talking to systemd this is not going to work of course, but quite > frankly, that's a problem you are making yourself, and I am not > particularly sympathetic to it. > I can study this option. It is not that I like or don't like talking to systemd, but the idea is that Slurm must work in other OSes, possibly without systemd but still with cgroup v2, and still be compatible with cgroup v1 and with no cgroup at all. It's thinking about the future, the less complexity and particularities it has, the more maintainable and flexible the software is. I think this is understandable, but if this is not possible at all we will have to adapt. > > DelegateCgroupLeaf=. If set to yes an extra directory will be > > created into the unit cgroup to place the newly spawned service process. > > This is useful for services which need to be restarted while its forked > > pids remain in the cgroup and the service cgroup is not a leaf > > anymore. > > No. Let's not add that. > I could foresee the benefits of such an option, but I can
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 22:16, Felip Moll (lip...@gmail.com) wrote: > Silvio, > > As I commented in my previous post, creating every single job in a separate > slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal structures and > manage it all is a no-no. Firing an async D-Bus packet to systemd should be hardly measurable. But note that you can also run your main service as a service, and then allocate a *single* scope unit for *all* your payloads. That way you can restart your main service unit independently of the scope unit, but you only have to issue a single request once for allocating the scope, and not for each of your payloads. But that too means you have to issue a bus call. If you really don't like talking to systemd this is not going to work of course, but quite frankly, that's a problem you are making yourself, and I am not particularly sympathetic to it. > One other option that I am thinking about is extending the parameters of a > unit file, for example adding a DelegateCgroupLeaf=yes option. > > DelegateCgroupLeaf=. If set to yes an extra directory will be > created into the unit cgroup to place the newly spawned service process. > This is useful for services which need to be restarted while its forked > pids remain in the cgroup and the service cgroup is not a leaf > anymore. No. Let's not add that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 18:07, Felip Moll (lip...@gmail.com) wrote: > > That's a bad idea typically, and a generally a hack: the unit should > > probably be split up differently, i.e. the processes that shall stick > > around on restart should probably be in their own unit, i.e. another > > service or scope unit. > > So, if I understand it correctly you are suggesting that every forked > process must be started through a new systemd unit? systemd has two different unit types: services and scopes. Both group processes in a cgroup. But only services are where systemd actually forks+execs (i.e. "starts a process"). If you want to fork yourself, that's fine, then a scope unit is your thing. If you use scope units you do everything yourself, but as part of your setup you then tell systemd to move your process into its own scope unit. > If that's the case it seems inconvenient because we're talking about a job > scheduler where sometimes may have thousands of forked processes executed > quickly, and where performance is key. > Having to manage a unit per each process will probably not work in this > situation in terms of performance. You don't really have to "manage" it. You can register a scope unit asynchronously, it's firing off one dbus message basically at the same time you fork things off, telling systemd to put it in a new scope unit. > The other option I can imagine is to start a new unit from my daemon of > Type=forking, which remains forever until I decide to clean it up even if > it doesn't have any process inside. > Then I could put my processes in the associated cgroup instead of inside > the main daemon cgroup. Would that make sense? Migrating processes wildly between cgroups is messy, because it fucks up accounting and is restricted permission-wise. Typically you want to create a cgroup and populate it, and then stick to that. > The issue here is that for creating the new unit I'd need my daemon to > depend on systemd libraries, or to do some fork-exec using systemd commands > and parsing output. To allocate a scope unit you'd have to fire off a D-Bus method call. No need for any systemd libraries. > I am trying to keep the dependencies at a minimum and I'd love to have an > alternative. Sorry, but if you want to rearrange processes in cgroups, or want systemd to manage your processes orthogonal to the service concept you have to talk to systemd. > Yeah, I know and understand it is not supported, but I am more interested > in the technical part of how things would break. > I see in systemd/src/core/cgroup.c that it often differentiates a cgroup > with delegation with one without it (!unit_cgroup_delegate(u)), but it's > hard for me to find out how or where this exactly will mess up with any > cgroup created outside of systemd. I'd appreciate it if you can give me > some light on why/when/where things will break in practice, or just an > example? THis depends highly on what precisely you do. At best systemd will complain or just override the changes you did outside of the tree you got delegated. You might break systemd as a whole though (for example, add a process directly to a slice's cgroup and systemd will be very sad). Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Am Montag, dem 21.02.2022 um 22:16 +0100 schrieb Felip Moll: > Silvio, > > As I commented in my previous post, creating every single job in a > separate slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal > structures and manage it all is a no-no. And how about an xinitd style daemon, excepting connections and spawning processes that way? So instead of sgamba1.service you would have a sgamba1@.service and a sgamba1.socket, spawning sgamba1@user1.service, sgamba1@user2.service, etc. units. So even if one user process dies, nothing else dies. And the setup overhead would only be once everytime a user creates a new connection. So they can still drop their one million jobs and you has still user isolation.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> On 21 Feb 2022, at 21:16, Felip Moll wrote: > > > >> You could invoke a man:systemd-run for each new process. Than you can >> put every single job in a seperate .slice with its own >> man:systemd.resource-control applied. >> This would also mean that you don't need to compile against libsystemd. >> Just exec() accordingly if a systemd-system is detected. >> >> BR >> Silvio > > Silvio, > > As I commented in my previous post, creating every single job in a separate > slice is an overhead I cannot assume. > An HTC system could run thousands of jobs per second, and doing extra > fork+execs plus waiting for systemd to fill up its internal structures and > manage it all is a no-no. Are you assuming this or did you measure the cost? Barry > > One other option that I am thinking about is extending the parameters of a > unit file, for example adding a DelegateCgroupLeaf=yes option. > > DelegateCgroupLeaf=. If set to yes an extra directory will be created > into the unit cgroup to place the newly spawned service process. This is > useful for services which need to be restarted while its forked pids remain > in the cgroup and the service cgroup is not a leaf anymore. This option is > only valid when using Delegate=yes and under a system in unified mode. > > E.g. in my example, that would end up like this: > /sys/fs/cgroup/system.slices/sgamba1.service <-- This is Delegated=yes > DelegateMultiCgroups=yes > ├── sgamba1 <-- The spawned process would be always put in here by > systemd. > ├── user1_stuff > ├── user2_stuff > └── user3_stuff > > I think this idea could work for cases like the one exposed here, and I see > this would be quite useful. >
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
You could invoke a man:systemd-run for each new process. Than you can > put every single job in a seperate .slice with its own > man:systemd.resource-control applied. > This would also mean that you don't need to compile against libsystemd. > Just exec() accordingly if a systemd-system is detected. > > BR > Silvio > Silvio, As I commented in my previous post, creating every single job in a separate slice is an overhead I cannot assume. An HTC system could run thousands of jobs per second, and doing extra fork+execs plus waiting for systemd to fill up its internal structures and manage it all is a no-no. One other option that I am thinking about is extending the parameters of a unit file, for example adding a DelegateCgroupLeaf=yes option. DelegateCgroupLeaf=. If set to yes an extra directory will be created into the unit cgroup to place the newly spawned service process. This is useful for services which need to be restarted while its forked pids remain in the cgroup and the service cgroup is not a leaf anymore. This option is only valid when using Delegate=yes and under a system in unified mode. E.g. in my example, that would end up like this: /sys/fs/cgroup/system.slices/sgamba1.service <-- This is Delegated=yes DelegateMultiCgroups=yes ├── sgamba1 <-- The spawned process would be always put in here by systemd. ├── user1_stuff ├── user2_stuff └── user3_stuff I think this idea could work for cases like the one exposed here, and I see this would be quite useful.
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
Am Montag, dem 21.02.2022 um 18:07 +0100 schrieb Felip Moll: > The hard requirement that my project has is that processes need to > live even if the daemon who forked them dies. > Roughly it is how a batch scheduler works: one controller sends a > request to my daemon for launching a process in the name of a user, > my daemon forks-exec it. At some point my daemon can be stopped, > restarted, upgraded, whatever but the forked processes need to always > be alive because they are continuing their work. We are talking here > about the HPC world. You could invoke a man:systemd-run for each new process. Than you can put every single job in a seperate .slice with its own man:systemd.resource-control applied. This would also mean that you don't need to compile against libsystemd. Just exec() accordingly if a systemd-system is detected. BR Silvio
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
> > Hmm? Hard requirement of what? Not following? > > The hard requirement that my project has is that processes need to live even if the daemon who forked them dies. Roughly it is how a batch scheduler works: one controller sends a request to my daemon for launching a process in the name of a user, my daemon forks-exec it. At some point my daemon can be stopped, restarted, upgraded, whatever but the forked processes need to always be alive because they are continuing their work. We are talking here about the HPC world. > You are leaving processes around when your service dies/restarts? > Yes. > That's a bad idea typically, and a generally a hack: the unit should > probably be split up differently, i.e. the processes that shall stick > around on restart should probably be in their own unit, i.e. another > service or scope unit. > So, if I understand it correctly you are suggesting that every forked process must be started through a new systemd unit? If that's the case it seems inconvenient because we're talking about a job scheduler where sometimes may have thousands of forked processes executed quickly, and where performance is key. Having to manage a unit per each process will probably not work in this situation in terms of performance. The other option I can imagine is to start a new unit from my daemon of Type=forking, which remains forever until I decide to clean it up even if it doesn't have any process inside. Then I could put my processes in the associated cgroup instead of inside the main daemon cgroup. Would that make sense? The issue here is that for creating the new unit I'd need my daemon to depend on systemd libraries, or to do some fork-exec using systemd commands and parsing output. I am trying to keep the dependencies at a minimum and I'd love to have an alternative. > That's not supported. You may only create your own cgroups where you > turned on delegation, otherwise all bets are off. If you put stuff in > /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's > "-.slice" without telling it so, and things will break sooner or > later, and often in non-obvious ways. > Yeah, I know and understand it is not supported, but I am more interested in the technical part of how things would break. I see in systemd/src/core/cgroup.c that it often differentiates a cgroup with delegation with one without it (!unit_cgroup_delegate(u)), but it's hard for me to find out how or where this exactly will mess up with any cgroup created outside of systemd. I'd appreciate it if you can give me some light on why/when/where things will break in practice, or just an example? I am also aware of the single-writer policy that systemd has in its documentation, and I am aware that this is not supported, but I'd like to understand exactly what can happen. Thanks for your help & time :)
Re: [systemd-devel] unable to attach pid to service delegated directory in unified mode after restart
On Mo, 21.02.22 14:14, Felip Moll (lip...@gmail.com) wrote: > Hello, > > I am creating a software which consists of one daemon which forks several > processes from user requests. > This is basically acting like a job scheduler. > > The daemon is started using a unit file and with Delegate=yes option, > because every process must be constrained differently. I manage my cgroup > hierarchy, create some leaves into the tree and put each pid there. > For example, after starting up the service and receiving 3 user requests, a > tree under /sys/fs/cgroup/system.slice/ could look like: > > sgamba1.service/ > ├── daemon_pid > ├── user1_stuff > ├── user2_stuff > └── user3_stuff > > I create the hierarchy and set cgroup.subtree_control in the root directory > (sgamba1.service in the example) and everything runs smoothly, until when I > decide to restart my service. > > The service then cannot restart: > > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed to attach to > cgroup /system.slice/sgamba1.service: Device or resource busy > feb 18 19:48:52 llit systemd[1143296]: sgamba1.service: Failed at step > CGROUP spawning /path_to_bin/mydaemond: Device or resource busy > > This is because systemd tries to put the pid of the new daemon in > sgamba1.service/cgroup.procs and this would break the "no internal process > constrain" rule for cgroup v2, since sgamba1.service is not a leaf anymore > because it has subtree_control enabled for the user stuff. > > One hard requirement is that user stuff must live even if the service is > restarted. Hmm? Hard requirement of what? Not following? You are leaving processes around when your service dies/restarts? That's a bad idea typically, and a generally a hack: the unit should probably be split up differently, i.e. the processes that shall stick around on restart should probably be in their own unit, i.e. another service or scope unit. > What's the way to achieve that? I see one easy way, which is to move user > stuff into its own cgroup and out of sgamba1.service/, but then it will run > outside a Delegate=yes unit. What can happen then? > Will systemd eventually migrate my processes? > How do services workaround that issue? > If I am moving user stuff into the root /sys/fs/cgroup/user_stuff/, will > systemd touch my directories? That's not supported. You may only create your own cgroups where you turned on delegation, otherwise all bets are off. If you but stuff in /sys/fs/cgroup/user-stuff its as if you placed stuff in systemd's "-.slice" without telling it so, and things will break sooner or later, and often in non-obvious ways. Lennart -- Lennart Poettering, Berlin