Is this with cgroups v1 or v2? If cgroups v1 is involved (thanks Docker), I recall it was a bit complex for systemd to get notified when the cgroup actually empties – via /sys/fs/cgroup/systemd/release_agent that specifies a helper executable that the kernel runs... I wonder if that mechanism is broken on your system.
On Wed, Oct 11, 2023 at 7:38 AM Martin Schwenke <mar...@meltin.net> wrote: > I'm seeing "systemctl stop <service>" for several services taking a > long time because it goes through the timeout process, even though all > relevant processes have exited. > > I'll give 2 examples. Both examples are running inside a privileged > Rocky Linux 8.8 Docker container on a Rocky Linux 8.8 host. The > systemd version, reported by "systemctl --version" in the container > is: > > systemd 239 (239-74.el8_8.5) > > Here is ctdb.system: > > [Unit] > Description=CTDB > Documentation=man:ctdbd(1) man:ctdb(7) > After=network-online.target time-sync.target > ConditionFileNotEmpty=/etc/ctdb/nodes > > [Service] > Type=forking > LimitCORE=infinity > LimitNOFILE=1048576 > TasksMax=4096 > PIDFile=/var/run/ctdb/ctdbd.pid > ExecStart=/usr/sbin/ctdbd > ExecStop=/usr/bin/ctdb shutdown > KillMode=control-group > Restart=no > > [Install] > WantedBy=multi-user.target > > "/usr/bin/ctdb shutdown" causes a controlled shutdown. In many cases, > starting and then stopping using systemctl works fine. However, many > times it takes >90s to stop, as per TimeoutStopSec. If I reduce that > value then the duration reduces accordingly. I can confirm using both > "ps auxfww" and "systemd-cgls" that within the container there are no > relevant processes a moment after "systemctl stop ctdb" is run. In > particular, in systemd-cgls ctdb.service is gone but "systemctl stop > ctdb" is still waiting. > > Before attempting to stop, the service is successfully started: > > Oct 11 00:56:44 rocky1 systemd[710741]: ctdb.service: Executing: > /usr/sbin/ctdbd > Oct 11 00:56:44 rocky1 ctdbd[710741]: CTDB logging to location > file:/var/log/log.ctdb > Oct 11 00:56:44 rocky1 systemd[1]: Received SIGCHLD from PID 710741 > (ctdbd). > Oct 11 00:56:44 rocky1 systemd[1]: Child 710741 (ctdbd) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Child 710741 belongs to > ctdb.service. > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Control process exited, > code=exited status=0 > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for > state start. > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: New main PID 710742 > belongs to service, we are happy. > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Main PID loaded: 710742 > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Changed start -> running > Oct 11 00:56:44 rocky1 systemd[1]: ctdb.service: Job ctdb.service/start > finished, result=done > Oct 11 00:56:44 rocky1 systemd[1]: Started CTDB. > -- Subject: Unit ctdb.service has finished start-up > -- Defined-By: systemd > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel > -- > -- Unit ctdb.service has finished starting up. > -- > -- The start-up result is done. > > The relevant part of the log while stopping seems to be: > > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710743 > (ctdb-eventd). > Oct 11 00:56:47 rocky1 systemd[1]: Child 710742 (ctdbd) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710742 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Can't open PID file > /var/run/ctdb/ctdbd.pid (yet?) after stop: No such file or directory > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Main process exited, > code=exited, status=0/SUCCESS > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1/unit/ctdb_2eservice > interface=org.freedesktop.DBus.Properties member=PropertiesChanged > cookie=54 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1/unit/ctdb_2eservice > interface=org.freedesktop.DBus.Properties member=PropertiesChanged > cookie=55 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a > Oct 11 00:56:47 rocky1 systemd[1]: Child 710743 (ctdb-eventd) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710743 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: systemd-journald.service: Received > EPOLLHUP on stored fd 18 (stored), closing. > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710860 > (ctdb). > Oct 11 00:56:47 rocky1 systemd[1]: Child 710860 (ctdb) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710860 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Control process exited, > code=exited status=0 > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Got final SIGCHLD for > state stop. > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Changed stop -> > stop-sigterm > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1/unit/ctdb_2eservice > interface=org.freedesktop.DBus.Properties member=PropertiesChanged > cookie=56 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a > Oct 11 00:56:47 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1/unit/ctdb_2eservice > interface=org.freedesktop.DBus.Properties member=PropertiesChanged > cookie=57 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710834 > (ctdb_lock_helpe). > Oct 11 00:56:47 rocky1 systemd[1]: Child 710834 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710834 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710835 > (ctdb_lock_helpe). > Oct 11 00:56:47 rocky1 systemd[1]: Child 710832 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710832 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710835 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710835 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710833 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710833 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710836 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710836 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710837 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710837 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710838 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710838 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Child 710839 (ctdb_lock_helpe) died > (code=exited, status=0/SUCCESS) > Oct 11 00:56:47 rocky1 systemd[1]: ctdb.service: Child 710839 belongs to > ctdb.service. > Oct 11 00:56:47 rocky1 systemd[1]: Received SIGCHLD from PID 710832 > (n/a). > Oct 11 00:57:44 rocky1 systemd[1]: systemd-journald.service: Got > notification message from PID 840 (WATCHDOG=1) > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: State 'stop-sigterm' > timed out. Killing. > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Failed with result > 'timeout'. > -- Subject: Unit failed > -- Defined-By: systemd > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel > -- > -- The unit ctdb.service has entered the 'failed' state with result > 'timeout'. > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Changed stop-sigterm -> > failed > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1 interface=org.freedesktop.DBus.Properties > member=PropertiesChanged cookie=58 reply_cookie=0 signature=sa{sv}as > error-name=n/a error-message=n/a > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Job ctdb.service/stop > finished, result=done > Oct 11 00:58:17 rocky1 systemd[1]: Stopped CTDB. > -- Subject: Unit ctdb.service has finished shutting down > -- Defined-By: systemd > -- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel > -- > -- Unit ctdb.service has finished shutting down. > Oct 11 00:58:17 rocky1 systemd[1]: Sent message type=signal > sender=org.freedesktop.systemd1 destination=n/a > path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager > member=JobRemoved cookie=59 reply_cookie=0 signature=uoss error-name=n/a > error-message=n/a > Oct 11 00:58:17 rocky1 systemd[1]: ctdb.service: Unit entered failed > state. > > It would be very useful if systemd could log what it is still waiting > for when it times out. > > Note that during start and stop, CTDB runs a lot of subprocesses, > including some that use systemctl to start and stop various services > that it, in turn, manages. > > The full debug level log, after running: > > systemd-analyze log-level debug > > is uploaded it to: > > https://meltin.net/uploads/systemd/ctdb-stop.log > > I'm happy to reply and attach it, but it is 48KB. > > The only theory I can come up with is some sort of race where > processes are created during shutdown and systemd gets confused. > > I see a similar thing for a much simpler service, winbind: > > Here is winbind.service: > > [Unit] > Description=Samba Winbind Daemon > Documentation=man:winbindd(8) man:samba(7) man:smb.conf(5) > After=network.target nmb.service > > [Service] > Type=notify > PIDFile=/var/run/winbindd.pid > EnvironmentFile=-/etc/sysconfig/samba > ExecStart=/usr/sbin/winbindd --foreground --no-process-group > $WINBINDOPTIONS > ExecReload=/bin/kill -HUP $MAINPID > LimitCORE=infinity > > [Install] > WantedBy=multi-user.target > > Yesterday I watched it do the same thing as CTDB. I could start the > service by hand but it would time out during stop, nearly every time, > even though there were no relevant processes running anymore. > winbindd sends a READY=1 notification after successfully starting. It > does not send STOPPING=1. winbindd is much simpler during shutdown. > I can get logs for this one too if necessary. > > Thanks for any help. > > peace & happiness, > martin > -- Mantas Mikulėnas