Hi!

I just realized: this may well be fixed by Shepherd commit
a5d9a41aa0bfadea9617df4411b86c6ea0c525bd (see
<https://issues.guix.gnu.org/77373>.)  Would be great to give it a try
somehow.

Anyway, some more data.

Yesterday I ran ‘herd stop nginx’ on bayfront (which was running 1.0.2),
again causing ‘herd status’ to hang afterwards.  This time I captured an
strace log of PID 1.

The /var/log/messages showing what happens after ‘herd stop nginx’

--8<---------------cut here---------------start------------->8---
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service 
certbot-certificate-renewal...
2025-04-08 11:28:27 localhost shepherd[1]: Terminating timer 
'certbot-certificate-renewal' with 0 processes running.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal 
stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal 
is now stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service nginx...
2025-04-08 11:28:27 localhost shepherd[1]: Registering new logger for nginx.
2025-04-08 11:28:28 localhost shepherd[1]: [nginx] nginx: [alert] could not 
open error log file: open() 
"/gnu/store/w2pi23rrk1dy95rzpb9l87gcj3c4jb38-nginx-1.27.3/logs/error.log" 
failed (2: No such file or directory)
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 
12958#0: duplicate MIME type "text/html" in 
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 
12958#0: duplicate MIME type "text/html" in 
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1107
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 
12958#0: duplicate MIME type "text/html" in 
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:4100
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 
12958#0: conflicting server name "bayfront.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx might have failed to 
stop.
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 
12958#0: conflicting server name "logs.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx is now stopped.
2025-04-08 11:28:32 localhost shepherd[1]: Accepted connection on 0.0.0.0:22 
from 101.35.215.239:37996.
2025-04-08 11:29:40 localhost shepherd[1]: 2 connections still in use after 
sshd-25913 termination.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 (PID 12257) 
exited with 3.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 has been disabled.
2025-04-08 11:32:37 localhost shepherd[1]: Accepted connection on [::]:22 from 
[2a0c:e300::58]:40428.
--8<---------------cut here---------------end--------------->8---

(The “might have failed to stop” bit is harmless and fixed in
3f4438741615aa1a1198f9415be8bd110138d83e.)

On berlin, which does not exhibit the bug, we get something similar (for
‘herd restart nginx’ here):

--8<---------------cut here---------------start------------->8---
2025-04-07 16:11:25 localhost shepherd[1]: Stopping service certbot-renewal...
2025-04-07 16:11:25 localhost shepherd[1]: Terminating timer 'certbot-renewal' 
with 0 processes running.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal stopped.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal is now 
stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Stopping service nginx...
2025-04-07 16:11:26 localhost shepherd[1]: Registering new logger for nginx.
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] nginx: [alert] could not 
open error log file: open() 
"/gnu/store/40mb76jc9qbm0d36qyncnpgrgsfab0db-nginx-1.27.3/logs/error.log" 
failed (2: No such file or directory)
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [warn] 
27359#0: duplicate MIME type "text/html" in 
/gnu/store/s2yrd68wr9cd2iirc900491qgigyfg0z-nginx.conf:3593
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [notice] 
27359#0: signal process started
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx is now stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/bootstrappable.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/issues-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/qualif.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/disarchive.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/bootstrappable.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/dump-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/workflows-guix-info.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/workflows-guix-info.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Starting service user-homes...
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes has been started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes running with 
value #t.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning 
anonip-/var/log/anonip/https.access.log.
--8<---------------cut here---------------end--------------->8---

Unfortunately the strace log isn’t very helpful:

--8<---------------cut here---------------start------------->8---
11:28:27 read(54, "(shepherd-command (version 0) (action stop) (service nginx) 
(arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 122 
<0.000162>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=326773810}) 
= 0 <0.000043>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327037318}) 
= 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327207449}) 
= 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000037>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327433802}) 
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327698776}) 
= 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000036>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327923662}) 
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328114326}) 
= 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328338724}) 
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=355846206}) 
= 0 <0.000069>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=38, u64=38}}], 16, 0) = 1 
<0.000045>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=356163002}) 
= 0 <0.000041>
11:28:27 read(38, "G", 1)               = 1 <0.000049>
11:28:27 read(38, "E", 1)               = 1 <0.000038>
11:28:27 read(38, "T", 1)               = 1 <0.000038>
[…]
11:28:27 write(7, "Stopping service certbot-certificate-renewal...\n", 48) = 48 
<0.000052>
[…]
11:28:27 write(25, "2025-04-08 11:28:27 localhost shepherd[1]: Service 
certbot-certificate-renewal stopped.\n", 88) = 88 <0.000049>
[…]
11:28:27 write(24, "shepherd[1]: Stopping service nginx...\n", 39) = 39 
<0.000050>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560464210}) 
= 0 <0.000045>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=17, u64=17}}], 16, 0) = 1 
<0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560715007}) 
= 0 <0.000042>
11:28:27 getuid()                       = 0 <0.000052>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=687926724}) 
= 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=688178499}) 
= 0 <0.000037>
11:28:27 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) 
= 0 <0.000055>
11:28:27 pipe2([56, 59], O_CLOEXEC)     = 0 <0.000060>
11:28:27 fcntl(56, F_GETFL)             = 0 (flags O_RDONLY) <0.000038>
11:28:27 lseek(56, 0, SEEK_CUR)         = -1 ESPIPE (Illegal seek) <0.000037>
11:28:27 fcntl(59, F_GETFL)             = 0x1 (flags O_WRONLY) <0.000036>
11:28:27 lseek(59, 0, SEEK_CUR)         = -1 ESPIPE (Illegal seek) <0.000036>
11:28:27 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:27 clone(child_stack=NULL, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0x7f01e6e99850) = 12958 <0.008960>
11:28:27 fcntl(56, F_GETFL)             = 0 (flags O_RDONLY) <0.000040>
11:28:27 fcntl(56, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000039>
11:28:27 close(59)                      = 0 <0.000039>
11:28:27 rt_sigprocmask(SIG_SETMASK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 
8) = 0 <0.000038>
11:28:27 recvfrom(17, "shepherd[1]: Stopping service nginx...\n", 1024, 0, 
0x7ffc10c01010, [112 => 0]) = 39 <0.000060>
[…]
11:28:27 read(56, 0x7f01d9a634d0, 1)    = -1 EAGAIN (Resource temporarily 
unavailable) <0.000040>
11:28:27 epoll_ctl(11, EPOLL_CTL_MOD, 56, 
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = -1 ENOENT 
(No such file or directory) <0.000041>
11:28:27 epoll_ctl(11, EPOLL_CTL_ADD, 56, 
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = 0 <0.000043>
11:28:27 write(7, "Registering new logger for nginx.\n", 34) = 34 <0.000048>
[…]
11:28:28 read(56, "n", 1)               = 1 <0.000049>
11:28:28 read(56, "g", 1)               = 1 <0.000046>
11:28:28 read(56, "i", 1)               = 1 <0.000042>
11:28:28 read(56, "n", 1)               = 1 <0.000040>
11:28:28 read(56, "x", 1)               = 1 <0.000039>
11:28:28 read(56, ":", 1)               = 1 <0.000039>
11:28:28 read(56, " ", 1)               = 1 <0.000062>
11:28:28 read(56, "[", 1)               = 1 <0.000039>
11:28:28 read(56, "a", 1)               = 1 <0.000042>
11:28:28 read(56, "l", 1)               = 1 <0.000042>
11:28:28 read(56, "e", 1)               = 1 <0.000041>
11:28:28 read(56, "r", 1)               = 1 <0.000044>
11:28:28 read(56, "t", 1)               = 1 <0.000064>
11:28:28 read(56, "]", 1)               = 1 <0.000043>
[…]
11:28:29 write(7, "[", 1)               = 1 <0.000048>
11:28:29 write(7, "nginx", 5)           = 5 <0.000035>
11:28:29 write(7, "] ", 2)              = 2 <0.000034>
11:28:29 write(7, "2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type 
\"text/html\" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043", 
130) = 130 <0.000036>
11:28:29 write(7, "\n", 1)              = 1 <0.000036>
[…]
11:28:32 accept4(57, {sa_family=AF_INET, sin_port=htons(37996), 
sin_addr=inet_addr("101.35.215.239")}, [112 => 16], SOCK_CLOEXEC) = 54 
<0.000058>
11:28:32 write(7, "Accepted connection on 0.0.0.0:22 from 
101.35.215.239:37996.\n", 61) = 61 <0.000050>
11:28:32 fcntl(54, F_GETFL)             = 0x2 (flags O_RDWR) <0.000037>
11:28:32 fcntl(54, F_SETFL, O_RDWR)     = 0 <0.000037>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512507676}) 
= 0 <0.000037>
11:28:32 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:32 epoll_wait(11, [{events=EPOLLIN, data={u32=6, u64=6}}], 16, 273) = 1 
<0.000039>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512933004}) 
= 0 <0.000037>
[…]
11:28:52 accept4(16, 0x7ffc10c01010, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 
EAGAIN (Resource temporarily unavailable) <0.000057>
11:28:52 epoll_ctl(11, EPOLL_CTL_MOD, 16, 
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=16, u64=16}}) = 0 <0.000042>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448197354}) 
= 0 <0.000037>
11:28:52 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448434462}) 
= 0 <0.000036>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, 
st_size=2962, ...}, 0) = 0 <0.000046>
11:28:52 write(41, "2025-04-08 11:28:52      39:4 17 (call-with-sigint _ _)\n", 
56) = 56 <0.000105>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, 
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(41, "2025-04-08 11:28:52 In fibers.scm:\n", 35) = 35 <0.000058>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, 
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo:     ludo : TTY=pts/9 ; 
PWD=/home/ludo/src/maintenance/hydra ; USER=root ; 
COMMAND=/run/current-system/profile/bin/herd status nginx\n", 168) = 168 
<0.000062>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450211551}) 
= 0 <0.000039>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450387060}) 
= 0 <0.000039>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, 
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: pam_unix(sudo:session): 
session opened for user root(uid=0) by (uid=30012)\n", 111) = 111 <0.000051>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450934120}) 
= 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=451064163}) 
= 0 <0.000059>
11:28:52 read(59, "(shepherd-command (version 0) (action status) (service 
nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) 
= 124 <0.000084>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452217926}) 
= 0 <0.000040>
11:28:52 epoll_wait(11, [], 16, 0)      = 0 <0.000044>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452460901}) 
= 0 <0.000039>
--8<---------------cut here---------------end--------------->8---

At that point it was still accepting connections on
/var/run/shepherd/socket and reading them, reading process output and
logging it, but not much else.

For example, the “Accepted connection on …:22” messages do not have a
corresponding ‘clone’ call and not even “Starting service sshd-1234...”
as we’d normally see right after.

Ludo’.



Reply via email to