** Attachment added: "Networkctl output"
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/2099676/+attachment/5859200/+files/networkctl.txt
** Description changed:
# Our problem
We are running multiple K8S clusters on Ubuntu 24.04.1 LTS nodes.
On one of these clusters, we have noticed at least twice that most of the
nodes (~5 out of 8) went offline without any action on our side.
To restore connectivity, we tried ifdown/ifup, disconnect/connect network
from hypervisor and networking service restart but nothing helped, we had to
reboot the nodes from the console.
After some investigations, we were able to correlate this outage with the
`apt-daily-upgrade` service run triggered by the `apt-daily-upgrade` timer.
- Somehow, the `apt-daily-upgrade` service updated a package which triggered a
`systemctl daemon-reexec`, cutting network connectivity in the process.
+ Somehow, the `apt-daily-upgrade` service updated a package which triggered a
`systemctl daemon-reexec`, cuting network connectivity in the process.
# Symptoms
Node is flagged as `NotReady` by K8s
SSH connection to node is not working
From the node, we can't ping the gateway
The output of `systemctl daemon-reexec` in `journalctl` is way more verbose
than usual :
```
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting requested from
client PID 2711048 ('systemctl') (unit apt-daily-upgrade.service)...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Reexecuting.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: systemd 255.4-1ubuntu8.5
running in system mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP
+GCRYPT -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC
+KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY +P11KIT +
QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -BPF_FRAMEWORK -XKBCOMMON +UTMP
+SYSVINIT default-hierarchy=unified)
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected virtualization vmware.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Detected architecture x86-64.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Starting man-db.service - Daily
man-db regeneration...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping containerd.service -
containerd container runtime...
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: ERR: ntpd exiting on signal 15
(Terminated)
févr. 21 06:06:55 lylux0634kdp004 ntpd[1106]: PROTO: 172.16.10.254 unlink
local addr 172.16.34.4 ->
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping ntpsec.service -
Network Time Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping open-vm-tools.service
- Service for virtual machines hosted on VMware...
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Journal stopped
févr. 21 06:06:55 lylux0634kdp004 systemd-journald[504]: Received SIGTERM
from PID 1 (systemd).
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopping
systemd-journald.service - Journal Service...
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Deactivated
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped ntpsec.service -
Network Time Service.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: ntpsec.service: Consumed 1min
12.819s CPU time, 12.4M memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Deactivated
successfully.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3374 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3375 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3475 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3512 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3545 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 3618 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Unit
process 2574706 (containerd-shim) remains running after unit stopped.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: Stopped containerd.service -
containerd container runtime.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Consumed
9min 54.298s CPU time, 3.4G memory peak, 0B memory swap peak.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: Found
left-over process 3374 (containerd-shim) in control group while starting unit.
Ignoring.
févr. 21 06:06:55 lylux0634kdp004 systemd[1]: containerd.service: This
usually indicates unclean termination of a previous run, or service
implementation deficiencies.
févr. 21 06:06:55