Hi Ludovic, Ludovic Courtès <l...@gnu.org> writes:
[...] > I see. So I’d say it’s a prerequisite (a patch that must come before) > but not entirely the same thing. I’m nitpicking! Eh, it's okay :-). Splitting changes into the right unit is a problem that is akin to naming things; it's hard! I welcome your suggestion. > We should make sure it doesn’t trigger thread-safety issues in libssh or > anything like that (running it repeatedly on a large machines.scm should > give us some confidence). It seems fine so far, but I've only tested in a loop with 4 build machines. When it nears completion I'll give it a shot on berlin. [...] > Yes, but note that this is just for ‘guix offload test’. The actual > code run while offloading will still fail badly. Ah, thanks for pointing that; I somehow thought that this machine status checking code was a prelude to every offloaded build. [...] >> I don't have a password set for my user on overdrive1, so can't attach >> strace to sshd, but yeah, we could try to capture it and see if we can >> understand what's going on. > > OK. I'd be happy to try strace when your are available. You can ping me on the chat. It's been more than 8 hours since I tried, so I should be able to trigger the problem :-). [...] > Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would > allow the handler to present more information as to which inferior is > failing. > > Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’. Good suggestions. I'll implement them. >> + (format (current-error-port) >> + (G_ "connection to machine '~a' lost; >> retrying~%") >> + (build-machine-name machine)) > > You can use ‘info’ instead of ‘format’. That also. Thanks! On another note, I was able to 'exercise' the fix, and the exception is raised but something fails with the following backtrace instead of being retried: --8<---------------cut here---------------start------------->8--- guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'... connection to machine 'overdrive1.guix.gnu.org' lost; retrying Backtrace: In ice-9/boot-9.scm: 1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _) In unknown file: 9 (apply-smob/0 #<thunk 7f915c028f60>) In ice-9/boot-9.scm: 724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>) In ice-9/eval.scm: 619:8 7 (_ #(#(#<directory (guile-user) 7f915c022c80>))) In guix/ui.scm: 2161:12 6 (run-guix-command _ . _) In ice-9/boot-9.scm: 1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _) 1747:15 4 (with-exception-handler #<procedure 7f91576bf0c0 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …) In srfi/srfi-1.scm: 634:9 3 (for-each #<procedure check-machine-availability (a)> (#<<build-machine> name: "overdriv…>)) In ice-9/eval.scm: 191:35 2 (_ #(#(#(#<directory (guix scripts offload) 7f9159852780> 3 #<<build-machine> na…> …) …) …)) Exception thrown while printing backtrace: In procedure frame-local-ref: Argument 2 out of range: 1 ice-9/boot-9.scm:1685:16: In procedure raise-exception: Wrong type to apply: 2 --8<---------------cut here---------------end--------------->8--- I haven't been able to pinpoint what yet. Notice that in the above code I've changed par-for-each by just for-each, doubting it might have something to do with it, but it appears unrelated. Thanks, Maxim