verification done on focal (full steps in comment #9) The issue is not reproducible with the package in -proposed, and is reproducible with the 2 switches for the old behavior (DHCP_FD_FLAGS_POKE=0 or dhcp.fd_flags_poke=0), as expected.
... # add-apt-repository -y 'deb http://archive.ubuntu.com/ubuntu focal- proposed main' # apt policy isc-dhcp-client isc-dhcp-client: Installed: 4.4.1-2.1ubuntu5.20.04.4 Candidate: 4.4.1-2.1ubuntu5.20.04.5 Version table: 4.4.1-2.1ubuntu5.20.04.5 500 500 http://archive.ubuntu.com/ubuntu focal-proposed/main amd64 Packages ... # wget https://launchpad.net/ubuntu/+archive/primary/+files/isc-dhcp- client-dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb # apt install -y isc-dhcp-client ./isc-dhcp-client- dbgsym_4.4.1-2.1ubuntu5.20.04.5_amd64.ddeb gdb Source code line numbers (for breakpoint): 233 isc_result_t omapi_register_io_object (omapi_object_t *h, ... 312 status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr, ... 333 for (p = omapi_io_states.next; # ip netns exec ns1 \ gdb -ex 'set target-async on' -ex 'set non-stop on' -ex 'set pagination off' -ex 'set confirm off' -q dhclient (gdb) break omapip/dispatch.c:333 (gdb) commands shell sleep 0.2 continue end (gdb) run -d -v veth1 ... Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x561afb034940, readfd=0x561afad13630 <if_readsocket>, writefd=writefd@entry=0x0, reader=0x561afad30fb0 <fallback_discard>, writer=writer@entry=0x0, reaper=reaper@entry=0x0) at dispatch.c:337 337 in dispatch.c DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x72ac8a14) DHCPOFFER of 192.168.42.100 from 192.168.42.1 DHCPREQUEST for 192.168.42.100 on veth1 to 255.255.255.255 port 67 (xid=0x148aac72) DHCPACK of 192.168.42.100 from 192.168.42.1 (xid=0x72ac8a14) [Detaching after fork from child process 1037683] bound to 192.168.42.100 -- renewal in 290 seconds. ^C Thread 1 "dhclient" received signal SIGINT, Interrupt. ... (gdb) run -d -v veth1 -r ... DHCPRELEASE of 192.168.42.100 on veth1 to 192.168.42.1 port 67 (xid=0x20449570) ... <<< WORKS 10/10 >>> ... (gdb) set environment DHCP_FD_FLAGS_POKE 0 (gdb) run -d -v veth1 ... Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x557c0d5d1350, readfd=0x557c0beb8630 <if_readsocket>, writefd=0x0, reader=0x557c0bed5fb0 <fallback_discard>, writer=0x0, reaper=0x0) at dispatch.c:337 337 in dispatch.c DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0xa5a2783d) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 6 (xid=0xa5a2783d) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 8 (xid=0xa5a2783d) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 10 (xid=0xa5a2783d) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 21 (xid=0xa5a2783d) ^C Thread 1 "dhclient" received signal SIGINT, Interrupt. ... (gdb) kill <<< FAILS 3/3 >>> (gdb) unset environment DHCP_FD_FLAGS_POKE ... (gdb) shell echo "$(cat /proc/cmdline) dhcp.fd_flags_poke=0" >/tmp/cmdline (gdb) shell mount --bind /tmp/cmdline /proc/cmdline (gdb) shell cat /proc/cmdline BOOT_IMAGE=/boot/vmlinuz-5.4.0-1084-kvm root=PARTUUID=7a9ea63e-b971-413c-9238-d59509520a9e ro console=tty1 console=ttyS0 dhcp.fd_flags_poke=0 (gdb) run -d -v veth1 ... Thread 1 "dhclient" hit Breakpoint 1, omapi_register_io_object (h=0x5604afcaac00, readfd=0x5604ae2d3630 <if_readsocket>, writefd=0x0, reader=0x5604ae2f0fb0 <fallback_discard>, writer=0x0, reaper=0x0) at dispatch.c:337 337 in dispatch.c DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 3 (xid=0x6095ca20) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0x6095ca20) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 18 (xid=0x6095ca20) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 7 (xid=0x6095ca20) DHCPDISCOVER on veth1 to 255.255.255.255 port 67 interval 14 (xid=0x6095ca20) ^C Thread 1 "dhclient" received signal SIGINT, Interrupt. ... (gdb) kill <<< FAILS 3/3 >>> (gdb) shell umount /proc/cmdline -- You received this bug notification because you are a member of Ubuntu Touch seeded packages, which is subscribed to isc-dhcp in Ubuntu. https://bugs.launchpad.net/bugs/1926139 Title: dhclient: thread concurrency race leads to DHCPOFFER packets not being received Status in bind9-libs package in Ubuntu: Won't Fix Status in isc-dhcp package in Ubuntu: Invalid Status in isc-dhcp source package in Focal: Fix Committed Status in isc-dhcp source package in Jammy: Fix Committed Bug description: [Impact] * Occasionally, during instance boot or machine start-up, dhclient will attempt to acquire a dhcp lease and fail, leaving the instance with no IP address and making it unreachable. * This happens about once every 100 reboots on bare metal, or affecting between ~0.3% to 2% of deployments on Azure (comment #2). * Azure uses dhclient called from cloud-init instead of systemd-networkd, and this is causing issues with larger deployments. * The logs of an affected dhclient produce the following: Listening on LPF/enp1s0/52:54:00:1c:d7:00 Sending on LPF/enp1s0/52:54:00:1c:d7:00 Sending on Socket/fallback DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ... DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ... ... (omitting 20 similar lines) ... DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ... DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ... DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ... No DHCPOFFERS received. No working leases in persistent database - sleeping. * This only impacts Focal and Jammy, where bind9-libs are multi-threaded (Bionic/earlier and Kinetic/later are single-threaded). * The actual problem is dhclient containing a thread concurrency race condition, and when the race occurs, the read socket is incorrectly/prematurely unwatched because required structures are not yet consistent, thus dhclient does not read any DHCPOFFER replies. * Detailed analysis of the issue is in comment #17. [Fix] * Prevent the race condition by starting to watch the read socket after required structures are consistent. * The fix has been tested in Azure w/ 13500 instances, and no errors have been observed (previously: 0.4%). * Anyway, in case regressions are observed, the patch introduces 2 switches to revert to previous behavior, which can be applied per-process or system-wide: - DHCP_FD_FLAGS_POKE=0 environment variable - dhcp.fd_flags_poke=0 kernel cmdline option * (Previous approaches/discussions included reverting bind9-libs to single-threaded, but we concluded it would have more regression risk than the expected [some bits in comment #8, and some internal chat], and remove exported symbols (apparently unused, but). We also considered a mutex/spinlock approach, but later found a simpler way w/ isc lib; comment #13.) [Test Plan] * Synthetic reproducer with GDB to force the race condition, and DHCP server/client/noise injection is described in comment #9. * Test with the original package (problem occurs). * Test with the modified package (problem fixed). - Set DHCP_FD_FLAGS_POKE=0 (problem occurs). - Set dhcp.fd_flags_poke=0 (problem occurs). [Regression Potential] * 1) dhclient failing to acquire DHCP leases. * 2) dhcpd is also affected by code changes, thus failures to handle DHCP lease requests also have potential for regressions. * 3) the functional change added by the fix, if a regression were to occur, would likely be an issue only under some (unknown) race condition as well, thus expected to be rare. * Note: this potentially affects Focal/Jammy on Azure as a whole, per usage of dhclient in cloud-init instead of systemd-networkd. Azure provided extensive testing for all 3 approaches (mostly internal communications, and some bug comments), with ~13k instances. No issues were observed (previously: 0.4%). * Such testing scale seems to indicate that there are no regressions for dhclient to acquire DHCP leases (1), nor another race condition that hit the fix/new behavior (3). With that, apparently (2) should be OK too. * Also, so to mitigate the regression risk as much as possible, there's very detailed analysis provided here (comments #17, #18) and more information about the fix in its patch file's comment. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/bind9-libs/+bug/1926139/+subscriptions -- Mailing list: https://launchpad.net/~touch-packages Post to : touch-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~touch-packages More help : https://help.launchpad.net/ListHelp