Hi! Maxim Cournoyer <maxim.courno...@gmail.com> skribis:
> Fixes <https://issues.guix.gnu.org/41625>. > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that it > takes a single machine object, to allow for retrying a single machine. Handle > the case where the checks raised an exception due to the connection to the > build machine having been lost, and retry up to 3 times. Ensure the cleanup > code is run in all situations. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). > * guix/inferior.scm (&inferior-connection-lost): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. [...] > +(define-condition-type &inferior-connection-lost &error > + inferior-connection-lost?) > + > (define* (read-repl-response port #:optional inferior) > "Read a (guix repl) response from PORT and return it as a Scheme object. > Raise '&inferior-exception' when an exception is read from PORT." > @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read > from PORT." > (match (read port) > (('values objects ...) > (apply values (map sexp->object objects))) > + ;; Unexpectedly read EOF from the port. This can happen for example when > + ;; the underlying connection for PORT was lost with Guile-SSH. > + (? eof-object? > + (raise (condition (&inferior-connection-lost)))) The match clause syntax is incorrect; should be: ((? eof-object?) (raise …)) > + (info (G_ "Testing ~a build machines defined in '~a'...~%") > (length machines) machine-file) > - (let* ((names (map build-machine-name machines)) > - (sockets (map build-machine-daemon-socket machines)) > - (sessions (map (cut open-ssh-session <> %short-timeout) machines)) > - (nodes (map remote-inferior sessions))) > - (for-each assert-node-has-guix nodes names) > - (for-each assert-node-repl nodes names) > - (for-each assert-node-can-import sessions nodes names sockets) > - (for-each assert-node-can-export sessions nodes names sockets) > - (for-each close-inferior nodes) > - (for-each disconnect! sessions)))) > + (par-for-each check-machine-availability machines))) Why not! IMO this should go in a separate patch, though, since it’s not related. > +(define (check-machine-availability machine) > + "Check whether MACHINE is available. Exit with an error upon failure." > + ;; Sometimes, the machine remote port may return EOF, presumably because > the > + ;; connection was lost. Retry up to 3 times. > + (let loop ((retries 3)) > + (guard (c ((inferior-connection-lost? c) > + (let ((retries-left (1- retries))) > + (if (> retries-left 0) > + (begin > + (format (current-error-port) > + (G_ "connection to machine ~s lost; > retrying~%") > + (build-machine-name machine)) > + (loop (retries-left))) > + (leave (G_ "connection repeatedly lost with machine > '~a'~%") > + (build-machine-name machine)))))) I’m afraid we’re papering over problems here. Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on berlin enough to reproduce the issue? If so, we could monitor/strace sshd on overdrive1 to get a better understanding of what’s going on. WDYT? Thanks, Ludo’.