bug#62619: Shepherd desertion upon service canonical name change?

2023-04-27 Thread Ludovic Courtès
Hi Bruno,

Bruno Victal  skribis:

> Upon a guix system reconfigure, if a running service has undergone a 
> canonical name change since the previous generation
> then shutdown or reboot commands fail, with shepherd indicating itself (root 
> service) as stopped.

Oh, fun.

> mirai@guix-nuc ~$ sudo reboot

[…]

> Service xvnc has been stopped.
> Service git-daemon has been stopped.
> Assertion (eq? (canonical-name new) (canonical-name old)) failed.
> assertion-failed()
> Service root has been stopped.

I think this assertion failure is the root cause.  It comes from
‘register-services’ in 0.9.3, which reads:

--8<---cut here---start->8---
(define (register-services . new-services)
  "Add NEW-SERVICES to the list of known services.  If a service has already
been registered, arrange to have it replaced when it is next stopped.  If it
is currently stopped, replace it immediately."
  (define (register-single-service new)
;; Sanity-checks first.
(assert (list-of-symbols? (provided-by new)))
(assert (list-of-symbols? (required-by new)))
(assert (boolean? (respawn? new)))

;; FIXME: Just because we have a unique canonical name now doesn't mean it
;; will remain unique as other services are added. Whenever a service is
;; added it should check that it's not conflicting with any already
;; registered canonical names.
(match (lookup-services (canonical-name new))
  (() ;; empty, so we can safely add ourselves
   (for-each (lambda (name)
   (let ((old (lookup-services name)))
 (hashq-set! %services name (cons new old
 (provided-by new)))
  ((old . rest) ;; one service registered, it may be an old version of us
   (assert (null? rest))
   (assert (eq? (canonical-name new) (canonical-name old)))
   (if (running? old)
   (slot-set! old 'replacement new)
   (replace-service old new)

  (for-each register-single-service new-services))
--8<---cut here---end--->8---

‘register-services’ was called from ‘replace-service’, itself called
from ‘stop’ (right after ‘networking’ had been actually stopped):

--8<---cut here---start->8---
;; SERVICE is no longer running.
(slot-set! service 'running #f)

;; Reset the list of respawns.
(slot-set! service 'last-respawns '())

;; Replace the service with its replacement, if it has one
(let ((replacement (slot-ref service 'replacement)))
  (when replacement
(replace-service service replacement))) ;<- here
--8<---cut here---end--->8---

The assertion failure was thrown here.  ‘stop’ calls itself
recursively but it doesn’t catch exceptions from recursive calls:

--8<---cut here---start->8---
  (fold-services (lambda (other acc)
   (if (and (running? other)
(required-by? service other))
   (append (stop other) acc)   ;<- here
   acc))
 '())
--8<---cut here---end--->8---

The problem is that ‘root-service’ marks itself as stopped before it has
effectively shut down the services:

--8<---cut here---start->8---
#:stop (lambda (unused . args)
 (local-output (l10n "Exiting shepherd..."))
 ;; Prevent that we try to stop ourself again.
 (slot-set! root-service 'running #f)
 (shutdown-services)
 (quit))
--8<---cut here---end--->8---

So what happened is that ‘shutdown-services’ threw, the exception wasn’t
caught, and thus it never called ‘quit’.  QED.


The service registry in the soon-to-be-released 0.10.0 no longer has
that assertion failure (it vanished in
08510a2a2aaab388c90dd402bd7506d33014454f).  Instead, it registers a
replacement for the first service found to have one of the names of the
new service.

The problem of the #:stop method of ‘root-service’ still exists: if
‘shutdown-services’ throws, the ‘root’ service won’t terminate and it’ll
remain in ‘stopping’ status.  We could wrap the ‘shutdown-services’ call
in ‘false-if-exception’, but I’d lean towards not doing it: it’s not
supposed to throw, so maybe it’s best not to swallow the exception.

To summarize, I believe the problem is solved in 0.10.

Thanks,
Ludo’.





bug#62619: Shepherd desertion upon service canonical name change?

2023-04-02 Thread Bruno Victal
Forwarded from: 

Upon a guix system reconfigure, if a running service has undergone a canonical 
name change since the previous generation
then shutdown or reboot commands fail, with shepherd indicating itself (root 
service) as stopped.


Suspected commit range with the change that triggered this:
from: f01b5299db6031174f05124b843c936388cd872a  --> to: 
380faf265b0c3b231ab8b69597d161be5e704e18


Console messages:

--8<---cut here---start->8---
mirai@guix-nuc ~$ sudo guix system reconfigure config.scm
…
building 
/gnu/store/72d7db3ckjswjdfv2w0a6wk2yspqakgc-upgrade-shepherd-services.scm.drv...
shepherd: Service host-name has been started.
shepherd: Service user-homes has been started.
shepherd: Service sysctl has been started.
shepherd: Service host-name has been started.
shepherd: Service term-console could not be started.
shepherd: Service x11-socket-directory has been started.
shepherd: Service NetworkManager conflicts with running services (networking).
To complete the upgrade, run 'herd restart SERVICE' to stop,
upgrade, and restart each service that was not automatically restarted.
Run 'herd status' to view the list of services on your system.
mirai@guix-nuc ~$ sudo reboot
Exiting shepherd...
Service nftables has been stopped.
Service console-font-tty2 has been stopped.
Service term-tty2 has been stopped.
Service avahi-daemon has been stopped.9
Service xorg-server has been stopped.
Service nscd has been stopped.
Service console-font-tty1 has been stopped.
Service term-tty1 has been stopped.
Service mcron has been stopped.
Service console-font-tty5 has been stopped.
Service term-tty5 has been stopped.
Service ntpd has been stopped.
Service tor has been stopped.
wrong-type-arg("for-each" "Wrong type argument: ~S" (4332) ())
Service xvnc has been stopped.
Service git-daemon has been stopped.
Assertion (eq? (canonical-name new) (canonical-name old)) failed.
assertion-failed()
Service root has been stopped.
mirai@guix-nuc ~$ sudo reboot
mirai@guix-nuc ~$ sudo shutdown
Service root is not running.
mirai@guix-nuc ~$
--8<---cut here---end--->8---


/var/log/messages excerpt:

--8<---cut here---start->8---
…
Apr  2 15:12:03 localhost elogind[421]: New session 1 of user root.
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was deleted, 
removing watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was created, 
adding watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/hosts` was written to
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was 
deleted, removing watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was 
deleted, removing watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was 
created, adding watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was 
created, adding watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/nsswitch.conf` was 
written to
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was 
deleted, removing watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was 
created, adding watch
Apr  2 15:12:33 localhost nscd: 11685 monitored file `/etc/services` was 
written to
Apr  2 15:12:44 localhost shepherd[1]: Evaluating user expression (and 
(defined? (quote transient?)) (map (# ?) ?)).
Apr  2 15:12:44 localhost shepherd[1]: Evaluating user expression (let 
((services (map primitive-load (# # (?)).
Apr  2 15:12:44 localhost shepherd[1]: Service host-name has been started.
Apr  2 15:12:44 localhost shepherd[1]: Service user-homes has been started.
Apr  2 15:12:44 localhost shepherd[1]: [sysctl] 
net.ipv6.conf.all.temp_valid_lft = 5400
Apr  2 15:12:44 localhost shepherd[1]: [sysctl] 
net.ipv6.conf.all.temp_prefered_lft = 2700
Apr  2 15:12:44 localhost shepherd[1]: [sysctl] fs.protected_hardlinks = 1
Apr  2 15:12:44 localhost shepherd[1]: [sysctl] fs.protected_symlinks = 1
Apr  2 15:12:44 localhost shepherd[1]: Service sysctl has been started.
Apr  2 15:12:44 localhost shepherd[1]: Service host-name has been started.
Apr  2 15:12:44 localhost shepherd[1]: Service term-console could not be 
started.
Apr  2 15:12:44 localhost shepherd[1]: Service x11-socket-directory has been 
started.
Apr  2 15:12:44 localhost shepherd[1]: Service NetworkManager conflicts with 
running services (networking).
Apr  2 15:12:44 localhost elogind[421]: Removed session 1.
Apr  2 15:13:07 localhost elogind[421]: New session 1 of user root.
Apr  2 15:13:07 localhost shepherd[1]: Exiting shepherd...
Apr  2 15:13:07 localhost shepherd[1]: Service nftables has been stopped.
Apr  2 15:13:07 localhost shepherd[1]: Service console-font-tty2 has been 
stopped.
Apr  2 15:13:07 localhost shepherd[1]: Service term-tty2 has been stopped.
Apr  2 15:13:07