Hi Ludovic, executive summary: it is (was) a "network architecture" mistake by my side, since I was mixing a device with static-network defined via guix with a bridge defined via libvirt... and this is not good. The more I think about it the more I'm convinced that trying to add a route for device "swws-bridge" (see below) in the "eno1" [1] static-networking declaration is simply a... mistake.
Julien I'm adidng you in Cc: only because you develop guile-netlink and maybe you could see if it's possible to improve netlink related error messages. Ludovic Courtès <l...@gnu.org> writes: > Giovanni Biscuolo <g...@xelera.eu> skribis: > >> after a reboot on a running remote host (it was running since several >> guix system generations ago... but with no reboots meanwhile) I get a >> failing networking service and consequently the ssh service (et al) >> refuses to start :-( >> >> Sorry I've no text to show you but a screenshot (see attachment below) >> because I'm connecting with a remote KVM console appliance. In a follow-up message I was then able to copy the actual error message: --8<---------------cut here---------------start------------->8--- Jun 14 11:28:32 localhost vmunix: [ 6.258520] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.472949] shepherd[1]: Service networking failed to start. Jun 14 11:28:32 localhost vmunix: [ 6.474842] shepherd[1]: Exception caught while starting networking: (no-such-device "swws-bridge") Jun 14 11:28:32 localhost vmunix: [ 6.492344] shepherd[1]: Starting service networking... Jun 14 11:28:32 localhost vmunix: [ 6.509652] shepherd[1]: Exception caught while starting networking: (%exception #<&netlink-response-error errno: 17>) Jun 14 11:28:32 localhost vmunix: [ 6.510034] shepherd[1]: Service networking failed to start. --8<---------------cut here---------------end--------------->8--- Then (in the same message) I described how I was able to solve my issue, this is the "core" of my configuration _mistake:_ --8<---------------cut here---------------start------------->8--- (service static-networking-service-type (list (static-networking (addresses (list (network-address (device ane-wan-device) (value (string-append ane-wan-ip4 "/24"))))) (routes (list (network-route (destination "default") (gateway ane-wan-gateway)))) ;; ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 ;; (network-route ;; (destination "10.1.2.0/24") ;; lxcbr0 net ;; (device swws-bridge-name) ;; (gateway "192.168.133.12")))) ;; on node002 (name-servers '("185.12.64.1" "185.12.64.1"))))) --8<---------------cut here---------------end--------------->8--- I commented out the second network-route definition, the one using "swws-bridge" [1] as device to route to 10.1.2.0/24 via 192.168.133.12. When I used that code, AFAIU the first time shepherd was trying to start the networking service, failing because "swws-bridge" is missing and (guile-)netlink fails with "no-such-device", then it tries again but fails because the very same route is already defined (but not functional). A failing networking service (although the interface is up and running) means that ssh (et al) fails to start, because networking is a ssh requisite. > 17 = EEXIST, which is netlink’s way of saying that the device/route/link > it’s trying to add already exists. Ah thanks! I was not able to find that error code. When run on the command line I get: --8<---------------cut here---------------start------------->8--- g@ane ~$ sudo ip route add 10.1.2.0/24 dev swws-bridge via 192.168.133.12 RTNETLINK answers: File exists --8<---------------cut here---------------end--------------->8--- Is it possible to have the same error and/or little bit of context in syslog when this happens with 'network-set-up/linux' Anyway, I think that "ip route" should just be idempotent... but maybe I'm missing something. (and this is obviously not a downstream issue) > The problem here is that static networking adds devices, routes, and > links (see ‘network-set-up/linux’ in the code). If it fails in the > middle, then it may have added devices without adding routes, so you end > up with half-configured networking. Ideally this would be > transactional. Well, actually it would be a pity to fail a whole static-networking "just" for a failing /secondary/ route, no? But as I told in the "executive summary", how could I /dare/ to declaratively add (with Guix System) a similar route for "swws-bridge" when "swws-bridge" is managed by libvirt? I should simply use libvirt to add that! :-) https://libvirt.org/formatnetwork.html#static-routes > When that happens, you need to check the logs and use the ‘ip’ command > to figure out which part failed exactly. In your case, the root problem > seems to be that “swws-bridge” did not exist. Yes, I can confirm this > Then you can (1) manually fix it with ‘ip’, and (2) adjust your Guix > System config to fix the problems you found. > > This is inconvenient at best. I would be interested in hearing > suggestions on how to improve on this. Oh well, for my use-case I don't think there is anything to improve: I just have to keep the "eno1" device configuration _separate_ from the "swws-bridge" one (even if "swws-bridge" was defined via static-network and not libvirt). The only suggestion I have is to add a more "user friendly" error messages in syslog for netlink-related errors, it wold have helped me more to read "adding route, RTNETLINK answers: File exists" than "netlink-response-error errno: 17" Thank you and... happy hacking! Gio' [1] swws-bridge-name is defined as "swws-bridge" ane-wan-device is defined as "eno1" -- Giovanni Biscuolo Xelera IT Infrastructures
signature.asc
Description: PGP signature