From the 'Other Info' section.
Detailed analysis of the issue and more.

[Other Info]

Full log: https://paste.ubuntu.com/p/8yBfw2KR5h/
Log of a working run: https://paste.ubuntu.com/p/N3ZgqrxyQD/

When you tcpdump dhclient, we see all DHCPDISCOVER packets being replied
to with DHCPOFFER packets, but the got_one() callback is never called,
dhclient does not read these DHCPOFFER packets, and continues sending
DHCPDISCOVER packets. Once it reaches 25 DHCPDISCOVER packets sent, it
gives up.

This behaviour led several bug reporters to believe it was a kernel
issue, with the kernel not pushing DHCPOFFER packets to dhclient. This
is not the case, the actual problem is dhclient containing a thread
concurrency race condition, and when the race occurs, the read socket is
closed prematurely, and dhclient does not read any of the DHCPOFFER
replies.

tcpdump: 'Screenshot of Wireshark' attached.

...

I was reading around the upstream issue trackers, and found the
following two bug reports:

https://gitlab.isc.org/isc-projects/dhcp/-/issues/264
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=996356

The ISC upstream report was actually quite detailed, and it has the same
symptoms of what we are experiencing.

Let's have a look at the root cause. The code I am using is isc-dhcp
4.4.1-2.1ubuntu5.20.04.4 from Focal.

common/discover.c

 567 void
 568 discover_interfaces(int state) {
...
1002         case AF_INET:
1003         default:
1004             status = omapi_register_io_object((omapi_object_t *)tmp,
1005                               if_readsocket,
1006                               0, got_one, 0, 0);
1007             break;
1008         }
...

In discover.c, we call discover_interfaces() to iterate over the
interfaces, and attempt to register a raw socket against it. We do this
by calling omapi_register_io_object() which is used for reading data,
and calls the elusive got_one() callback that you instrumented your code
to see if it was being called or not.

omapip/dispatch.c

196 /* Register an I/O handle so that we can do asynchronous I/O on it. */
197
198 isc_result_t omapi_register_io_object (omapi_object_t *h,
199                        int (*readfd) (omapi_object_t *),
200                        int (*writefd) (omapi_object_t *),
201                        isc_result_t (*reader)
202                         (omapi_object_t *),
203                        isc_result_t (*writer)
204                         (omapi_object_t *),
205                        isc_result_t (*reaper)
206                         (omapi_object_t *))
207 {
...
241     /*
242      * Attach the I/O object to the isc socket library via the
243      * fdwatch function.  This allows the socket library to watch
244      * over a socket that we built.  If there are both a read and
245      * a write socket we asssume they are the same socket.
246      */
247
248     if (readfd) {
249         fd_flags |= ISC_SOCKFDWATCH_READ;
250         fd = readfd(h);
251     }
...
257
258     if (fd_flags != 0) {
259         status = isc_socket_fdwatchcreate(dhcp_gbl_ctx.socketmgr,
260                           fd, fd_flags,
261                           omapi_iscsock_cb,
262                           obj,
263                           dhcp_gbl_ctx.task,
264                           &obj->fd);
...
275     }
276
277
278     /* Find the last I/O state, if there are any. */
279     for (p = omapi_io_states.next;
280          p && p -> next; p = p -> next)
281         ;
282     if (p)
283         omapi_io_reference (&p -> next, obj, MDL);
284     else
285         omapi_io_reference (&omapi_io_states.next, obj, MDL);
...

omapi_register_io_object() is called for each socket created, in this
case, the if_readsocket from discover_interfaces(). The file descriptor
is assigned ISC_SOCKFDWATCH_READ, and we enter the if statement.

The if statement calls isc_socket_fdwatchcreate(), which registers the
socket with the ISC socket manager, and sets up the callback
omapi_iscsock_cb(), to be called.

Once that has been done, we iterate over the omapi_io_states linked
list, which is a global list of registered sockets. We get to the end of
the list (or the beginning, if the list is empty), and add the socket to
the list.

Now, the bug happens between calling isc_socket_fdwatchcreate() to
register the socket with the socket manager, and adding it to the global
linked list.

Sometimes, the callback omapi_iscsock_cb() is called inbetween.

omapip/dispatch.c

101 /*
102  * Callback routine to connect the omapi I/O object and socket with
103  * the isc socket code.  The isc socket code will call this routine
104  * which will then call the correct local routine to process the bytes.
105  *
106  * Currently we are always willing to read more data, this should be 
modified
107  * so that on connections we don't read more if we already have enough.
108  *
109  * If we have more bytes to write we ask the library to call us when
110  * we can write more.  If we indicate we don't have more to write we need
111  * to poke the library via isc_socket_fdwatchpoke.
112  */
...
118 #define SOCKDELETE 1
119 int
120 omapi_iscsock_cb(isc_task_t   *task,
121          isc_socket_t *socket,
122          void         *cbarg,
123          int           flags)
124 {D
...
132 #if SOCKDELETE
133     /*
134      * walk through the io states list, if our object is on there
135      * service it.  if not ignore it.
136      */
137     for (obj = omapi_io_states.next; obj != NULL; obj = obj->next) {
138         if (obj == cbarg)
139             break;
140     }
141
142     if (obj == NULL) {
143         return(0);
144     }
145 #else
...
164     if ((flags == ISC_SOCKFDWATCH_READ) &&
165         (obj->reader != NULL) &&
166         (obj->inner != NULL)) {
167         status = obj->reader(obj->inner);
168         /*
169          * If we are shutting down (basically tried to
170          * read and got no bytes) we don't need to try
171          * again.
172          */
173         if (status == ISC_R_SHUTTINGDOWN)
174             return (0);
175         /* Otherwise We always ask for more when reading */
176         return (1);
...
188     /*
189      * We get here if we either had an error (inconsistent
190      * structures etc) or no more to write, tell the socket
191      * lib we don't have more to do right now.
192      */
193     return (0);
194 }

When omapi_iscsock_cb() is called, we iterate over that same
omapi_io_states global linked list, and since the raw socket we are
about to add is the very first item going to be placed on the list,
omapi_io_states.next will be NULL, making obj set to NULL. Well what
happens when obj is NULL? We return 0.

142     if (obj == NULL) {
143         return(0);
144     }

If you look down the code a little further, we see that if we are
finished with the socket, like shutting down, or an error happens, we
return 0 to indicate that we are done with this socket, and won't be
needing it anymore, and won't be using it again.

If we return 1, then we ask for more data later on, and
omapi_iscsock_cb() will be called again in the future.

So, by returning 0, we will no longer read any packets from the socket,
and thus, we simply ignore and will never read any DHCPOFFER packets.

We obviously have two sockets open, one for reading and one for writing,
since one is bound to port 67 and the other to port 68, so we keep
sending out DHCPDISCOVER packets, not knowing we are ignoring all
DHCPOFFERS due to the read socket being closed and not calling
omapi_iscsock_cb() ever again.

So, as explained on the upstream bug, what is needed for this race
condition to occur?

1) There needs to be data that can be read immediately by the socket. This is 
how omapi_iscsock_cb() is called, when there is data waiting. If there is no 
data to read, then the race can't happen.
2) Thread scheduling must occur in a particular order.

There are three threads:

a. The main thread, which is registering the socket
b. The socket manager coordinator thread
c. A socket manager worker thread

For the race to occur, we need to switch from a) to b) to schedule the
callback, b) to c) to call the callback and return 0, all before we
return from c) to a) to add the socket to the omapi_io_states linked
list.

To verify this, I added some print statements to omapi_iscsock_cb() and
omapi_register_io_object() to see what is happening:

https://paste.ubuntu.com/p/hGsssrVyG4/

I instrumented entry to the callback, as well as if obj is NULL, and
before and after the socket is registered to the socket manager, and
before and after the socket is added to the global linked list.

I built the new dhclient, and ran test-parallel.sh, since it by far is
the best at reproducing the issue. I concatenated all 20 /tmp/dhclient-*
logs to read:

https://paste.ubuntu.com/p/VddddSmdMV/

If we look at one that works and gets the DHCPOFFER, we see:

https://paste.ubuntu.com/p/dfdvNM3RDd/

We have two sockets, one that reads and one that writes. It registers
with the socket manager, adds the socket to the linked list, for both
sockets, one after the other, and then the DHCPDISOVER is set, and then
callbacks are called.

Now, when the issue reproduces:

https://paste.ubuntu.com/p/77BNNhDwdh/

We see that we register the read socket with the socket manager, and
immediately get a callback to omapi_iscsock_cb(), so there is likely a
packet to read already. We return NULL, since the socket is not on the
linked list yet, and strangely this happens twice when we have only
entered omapi_iscsock_cb() once. We then add the socket to the linked
list, showing we have hit the race condition. The writer socket is added
to the socket manager, but does not race, as it is added to the linked
list before the callback happens. The callback only seems to be called
for the writer socket, and we ignore all DHCPOFFER packets, only sending
DHCPDISCOVER packets.

I ran the test-parallel.sh script a few times, and each and every time
the issue reproduced, we had a NULL linked list, and returned 0 from
omapi_iscsock_cb(). So I am confident we have come across the root
cause.

Now, I had an idea of adding a sleep between registering to the socket
manager and adding the socket to the linked list, to try and get every
single agent to fail 100% of the time, hoping that the callback would be
called first.

The change is something like the likes of below:

https://paste.ubuntu.com/p/8zJ2FQK99X/

But after I built it and ran it a few times, I couldn't reproduce the
issue at all. A typical run is below:

https://paste.ubuntu.com/p/k6wRhx4RCM/

We can see that everything happens in the correct order, and the issue
is not reproduced. I think because each dhclient has to wait 2 seconds
before sending a DHCPDISCOVER, the network is quiet, and there is no
packet to read immediately, since neighboring agents are also quiet, so
the callback is not called, and the socket is added to the linked list
before the network gets busy.

The code itself to add the socket to the linked list was added in the
below commit, 22 years ago:

commit 61b844bfd7641a62d681a1f70d3b6dbc485ce4b6
From: Ted Lemon <sou...@isc.org>
Date: Thu, 2 Sep 1999 00:32:56 +0000
Subject: Omapi library, initial checkin
Link: 
https://gitlab.isc.org/isc-projects/dhcp/-/commit/61b844bfd7641a62d681a1f70d3b6dbc485ce4b6

Later on, the registration to the socket manager and the callback was
added 13 years ago:

commit 98bf16077d22f28e288a18e184a9d1f97cb5f4f7
From: Shawn Routhier <s...@isc.org>
Date: Wed, 28 Oct 2009 04:12:30 +0000
Subject: Support for asynchronous ddns per ticket 19216 - convert to
 using isclib and dnslib.
Link: 
https://gitlab.isc.org/isc-projects/dhcp/-/commit/98bf16077d22f28e288a18e184a9d1f97cb5f4f7

Finally, the null linked list check in the callback that returns 0 was
also added 13 years ago:

commit 8fa0112dd11ee301046b42fe463074b067e61a35
From: Shawn Routhier <s...@isc.org>
Date: Thu, 19 Nov 2009 23:49:57 +0000
Subject: Ticket 20540 - clean ups for DHCP-DDNS work, add an change
 dropped in the merge, arrange to handle callbacks from the socket code after
 we've deleted the socket and deal with the quantum issue in the tasks.  This
 last will be removed shortly as a better fix is now available.
Link: 
https://gitlab.isc.org/isc-projects/dhcp/-/commit/8fa0112dd11ee301046b42fe463074b067e61a35

This code and thus the bug is present in all versions of Ubuntu from
10.04 LTS onward.

However, I could not reproduce on Bionic, Kinetic or Lunar, so I looked
into why this only seems to affect Focal and Jammy, when all versions
share the same buggy code.

It appears that Focal and Jammy dhclient is multithreaded:

$ sudo ps -T -p 19512
    PID    SPID TTY          TIME CMD
  19512   19512 ?        00:00:00 dhclient
  19512   19519 ?        00:00:00 isc-worker0000
  19512   19521 ?        00:00:00 isc-socket
  19512   19522 ?        00:00:00 isc-timer

While on Bionic, Kinetic and Lunar, it presents only a single thread:

$ sudo ps -T -p 23894
  PID  SPID TTY          TIME CMD
23894 23894 ?        00:00:00 dhclient

You can't have thread concurrency issues if you run as a single thread.
The race conditions simply did not happen, since everything had to
execute in order, which is why this bug has existed for 13 years in its
current form.

Chris Patterson in comment #2 mentions when he built dhclient from the
latest Debian source, available here:

https://salsa.debian.org/debian/isc-dhcp/-/commits/master/deDbian

Chris mentions that it solves the problem, and that it uses in-tree bind
libraries. Looking into this, we see the tree has bind 9.11.36:

https://salsa.debian.org/debian/isc-dhcp/-/blob/master/bind/version.tmp

and its makefile explicitly shows threads being disabled:

https://salsa.debian.org/debian/isc-dhcp/-/blob/master/bind/Makefile.in

bindconfig =
...
--disable-threads
...

bind 9.11.36 still supports disabling threads. Hence this is how the
upstream debian build did not reproduce the issue on Focal, since
threading was disabled and dhclient ran as a single thread.

On Kinetic and Lunar, the isc-dhcp package seems to track the debian
upstream version, with the very same in-tree bind libraries, set to
--disable-threads, which is why it is fixed in those versions.

So, we have two options for a fix for Focal and Jammy:

1) We disable threading for dhclient.
2) We add in a mutex to resolve this particular concurrency issue.

Looking at the source code, there is no mutexes or semaphores. pthread
isn't even a dependency, and it is not linked in. There is no thread
synchronisation primitives at all.

Initially Matthew proposed 1) as the solution.

It seems isc-dhcp on Focal and Jammy are both incompatible with their
native bind9 packages, and instead, use the libraries from bind9-libs, a
separate package that packages 9.11.16 for Focal and 9.11.19 for Jammy,
to work around the incompatibility issue.

https://launchpad.net/ubuntu/+source/bind9-libs

Looking at the dependencies for isc-dhcp in Focal:

$ apt depends isc-dhcp-client
isc-dhcp-client
  Depends: libc6 (>= 2.15)
  Depends: libdns-export1109
  Depends: libisc-export1105
  Depends: debianutils (>= 2.8.2)
  Depends: iproute2
  Recommends: isc-dhcp-common
  Suggests: resolvconf
    openresolv
  Suggests: avahi-autoipd
  Suggests: isc-dhcp-client-ddns

We see we depend on libdns-export1109 and libisc-export1105 from
bind9-libs.

Looking at those reverse dependencies:

$ apt rdepends libisc-export1105
libisc-export1105
Reverse Depends:
  Depends: libbind-export-dev (= 1:9.11.16+dfsg-3~ubuntu1)
  Depends: isc-dhcp-client
  Depends: isc-dhcp-server-ldap
  Depends: isc-dhcp-relay
  Depends: isc-dhcp-client-ddns
  Depends: libisccfg-export163
  Depends: libisccc-export161
  Depends: libirs-export161
  Depends: libdns-export1109
  Depends: libbind-export-dev (= 1:9.11.16+dfsg-3~build1)
  Depends: isc-dhcp-server
  Depends: isc-dhcp-client
  Depends: isc-dhcp-server-ldap
  Depends: isc-dhcp-relay
  Depends: isc-dhcp-client-ddns
  Depends: libisccfg-export163
  Depends: libisccc-export161
  Depends: libirs-export161
  Depends: libdns-export1109
  Depends: isc-dhcp-client
  Depends: isc-dhcp-server

It seems it is only isc-dhcp is a reverse dependency.

So it appears perfectly safe to change bind9-libs from --enable-threads
to --disable-threads.

Looking at their debian/rules files:

https://git.launchpad.net/ubuntu/+source/bind9-libs/tree/debian/rules?h=applied/ubuntu/focal-proposed
https://git.launchpad.net/ubuntu/+source/bind9-libs/tree/debian/rules?h=applied/ubuntu/jammy-devel

We see:

CONFIGURE_COMMON := \
...
 --disable-threads \
...
 --enable-threads \
...

It appears there has been a mistake when setting up the configure
options, and both --disable-threads and --enable-threads is set. Because
of the way configure is set up --enable-threads will always win.

The fix would be a one line change, to remove --enable-threads.

However, this is a unacceptably high regression risk, due to the fact
that building bind9-libs with --disable-threads means we remove publicly
exported symbols from libraries:

https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641515/+files/lp1926139_focal.debdiff
https://bugs.launchpad.net/ubuntu/+source/isc-dhcp/+bug/1926139/+attachment/5641516/+files/lp1926139_jammy.debdiff

As an alternative, Mauricio developed 2), which is to add a mutex that
restricts access to the global socket list, ensuring that sockets are
added to the list before a callback is allowed to walk the list and read
its entries.

https://launchpadlibrarian.net/646801520/lp1926139_focal_isc-
dhcp.debdiff

Mauricio's solution is elegant as it reuses an unused variable,
obj->closed, as a mutex, and uses a compiler built-in memory barrier,
__sync_synchronize(), meaning that there is no need to link libpthread
or add any extra thread synchronisation primitives.

This has been tested with 13k VM deployments on Microsoft Azure, and has
found to work as expected with no failures, meaning risk of additional
race conditions we are not aware of is low.

The reason why this patch was not forwarded upstream, is that isc-dhcp
is now officially End Of Life, and has effectively been abandoned by
upstream. You can read about it in these notices:

https://lists.isc.org/pipermail/dhcp-users/2022-October/022786.html
https://www.isc.org/blogs/isc-dhcp-eol/

Upstream won't fix any more bugs, make any new releases, or even accept
any new commits. They are putting their efforts into isc-kea now.

...

(Another fix approach is now in comment #13).

-- 
You received this bug notification because you are a member of SE SRU
("STS") Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/1926139

Title:
  dhclient: thread concurrency race leads to DHCPOFFER packets not being
  received

Status in bind9-libs package in Ubuntu:
  Won't Fix
Status in isc-dhcp package in Ubuntu:
  Invalid
Status in isc-dhcp source package in Focal:
  In Progress
Status in isc-dhcp source package in Jammy:
  In Progress

Bug description:
  [Impact]

   * Occasionally, during instance boot or machine start-up,
     dhclient will attempt to acquire a dhcp lease and fail,
     leaving the instance with no IP address and making it
     unreachable.

   * This happens about once every 100 reboots on bare metal,
     or affecting between ~0.3% to 2% of deployments on Azure
     (comment #2).
     
   * Azure uses dhclient called from cloud-init instead of
     systemd-networkd, and this is causing issues with larger
     deployments.

   * The logs of an affected dhclient produce the following:

     Listening on LPF/enp1s0/52:54:00:1c:d7:00
     Sending on   LPF/enp1s0/52:54:00:1c:d7:00
     Sending on   Socket/fallback
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     ...
     (omitting 20 similar lines)
     ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     DHCPDISCOVER on enp1s0 to 255.255.255.255 port 67 ...
     No DHCPOFFERS received.
     No working leases in persistent database - sleeping.

   * This only impacts Focal and Jammy, where bind9-libs
     are multi-threaded (Bionic/earlier and Kinetic/later
     are single-threaded).

   * The actual problem is dhclient containing a thread
     concurrency race condition, and when the race occurs,
     the read socket is incorrectly/prematurely unwatched
     because required structures are not yet consistent,
     thus dhclient does not read any DHCPOFFER replies.

   * Detailed analysis of the issue is in comment #17.

  [Fix]

   * Prevent the race condition by starting to watch the
     read socket after required structures are consistent.
   
   * The fix has been tested in Azure w/ 13500 instances,
     and no errors have been observed (previously: 0.4%).
     
   * Anyway, in case regressions are observed, the patch
     introduces 2 switches to revert to previous behavior,
     which can be applied per-process or system-wide:
     - DHCP_FD_FLAGS_POKE=0 environment variable
     - dhcp.fd_flags_poke=0 kernel cmdline option 
   
   * (Previous approaches/discussions included reverting
      bind9-libs to single-threaded, but we concluded it
      would have more regression risk than the expected
      [some bits in comment #8, and some internal chat],
      and remove exported symbols (apparently unused, but).
      We also considered a mutex/spinlock approach, but
      later found a simpler way w/ isc lib; comment #13.)
      
  [Test Plan]

   * Synthetic reproducer with GDB to force the race
     condition, and DHCP server/client/noise injection
     is described in comment #9.
   
   * Test with the original package (problem occurs).
   
   * Test with the modified package (problem fixed).
     - Set DHCP_FD_FLAGS_POKE=0 (problem occurs).
     - Set dhcp.fd_flags_poke=0 (problem occurs).

  [Regression Potential]

   * 1) dhclient failing to acquire DHCP leases.
      
   * 2) dhcpd is also affected by code changes,
     thus failures to handle DHCP lease requests
     also have potential for regressions.
     
   * 3) the functional change added by the fix,
     if a regression were to occur, would likely
     be an issue only under some (unknown) race
     condition as well, thus expected to be rare.

   * Note: this potentially affects Focal/Jammy 
     on Azure as a whole, per usage of dhclient
     in cloud-init instead of systemd-networkd.
     
     Azure provided extensive testing for all 3
     approaches (mostly internal communications,
     and some bug comments), with ~13k instances.
     
     No issues were observed (previously: 0.4%).
     
   * Such testing scale seems to indicate that
     there are no regressions for dhclient to
     acquire DHCP leases (1), nor another race
     condition that hit the fix/new behavior (3).
     
     With that, apparently (2) should be OK too.
     
   * Also, so to mitigate the regression risk
     as much as possible, there's very detailed
     analysis provided here (comments #17, #18)
     and more information about the fix in its
     patch file's comment.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/bind9-libs/+bug/1926139/+subscriptions


-- 
Mailing list: https://launchpad.net/~sts-sponsors
Post to     : sts-sponsors@lists.launchpad.net
Unsubscribe : https://launchpad.net/~sts-sponsors
More help   : https://help.launchpad.net/ListHelp

Reply via email to