[Bug 2152692] Re: apcupsd NIS server deadlock: write_nbytes() has no send timeout, mutex held across network I/O → FD exhaustion under broken TCP

Thomas Lakofski Fri, 15 May 2026 23:50:37 -0700

════════════════════════════════════════════════════════════════
ADDENDUM — 2026-05-16: SECOND DEADLOCK: write_lock(ups) HELD
ACROSS USB/MODBUS I/O → CLOSE-WAIT ACCUMULATION WHEN A
USB TRANSACTION STALLS
════════════════════════════════════════════════════════════════


BACKGROUND

The fix described above was deployed on 2026-05-15 10:02 UTC. At
15:59 UTC the same day (5 hours 57 minutes later), a remote apcupsd
instance monitoring the NIS port declared comms lost and sent an alert.
Inspection revealed a qualitatively different failure mode, also caused
by blocking I/O under a mutex — but in the driver layer rather than the
NIS layer.

SYMPTOM

  • Munin netstat_established gauge climbed steadily at ~3/min.
    NOTE: Linux's CurrEstab (read by the munin netstat plugin from
    /proc/net/snmp) includes both ESTABLISHED and CLOSE-WAIT sockets,
    per RFC 4022. The gauge therefore reflects the new bug accurately
    even though the connection state is CLOSE-WAIT, not ESTABLISHED.

  • ss -tan showed:
      2581 CLOSE-WAIT   ← all on local port 3551 (NIS), Recv-Q=9 each
         7 ESTAB
        48 LISTEN
        15 TIME-WAIT

    Recv-Q=9 on every CLOSE-WAIT socket means the remote end closed
    after sending its 9-byte NIS request (2-byte length prefix + 7-byte
    "status\n"), but apcupsd had not read even the first byte — the
    connection threads were blocked before reaching net_recv().

  • apcupsd thread count: 2825. All NIS handler threads were in
    futex_do_wait (confirmed via /proc/<pid>/task/<tid>/stack).

  • All 2825 threads were waiting on the SAME futex address:
      syscall 202 (futex) arg1=0x6194fc09a2e0 op=FUTEX_WAIT_PRIVATE val=2
    The address falls in the process heap (confirmed via /proc/<pid>/maps),
    identifying it as ups->mutex (a pthread_mutex allocated in new_ups()).

  • The main thread (TID = PID) was not in futex_wait. Its kernel stack:
      reap_as+0xb4/0x120
      usbdev_do_ioctl+0x3e9/0x1510
      usbdev_ioctl+0xe/0x20
      __x64_sys_ioctl+0xa3/0x100
    Its syscall: 16 (ioctl) fd=7 request=0x4008550c.
    Decoding: _IOW('U', 12, void *) = USBDEVFS_REAPURB.
    FD 7 = /dev/bus/usb/005/003 = the APC UPS (lsusb: 051d:0003
    American Power Conversion UPS).

  The main thread was permanently blocked waiting for a USB URB to
  complete, while holding ups->mutex. Every NIS connection thread was
  blocked acquiring that mutex.

ROOT CAUSE

The UPS in question is a Smart-UPS 1500 communicating via the apcupsd
MODBUS driver (DRIVER: MODBUS UPS Driver / src/drivers/modbus/).
apcupsd also links libusb-0.1 for the generic USB HID driver; both
drivers exhibit the same structural flaw.

─── Flaw 1: UPS I/O under write_lock ──────────────────────────────

read_volatile_data() and read_static_data() in src/drivers/usb/usb.c
(and their equivalents in src/drivers/modbus/modbus.cpp) hold
write_lock(_ups) — which is P(ups->mutex) — across every USB or MODBUS
I/O call:

  MODBUS (modbus.cpp):
    bool ModbusUpsDriver::read_volatile_data() {
        write_lock(_ups);           ← ups->mutex acquired
        bool ret = UpdateCis(true); ← calls _comm->ReadRegister() per CI
        _ups->poll_time = time(NULL);
        write_unlock(_ups);
    }

  USB/libusb (usb.c):
    bool UsbUpsDriver::read_volatile_data() {
        write_lock(_ups);           ← ups->mutex acquired
        ...
        usb_update_value(ci);       ← calls pusb_get_value() → GetReport()
        ...
        write_unlock(_ups);
    }

  Additionally, GenericUsbUpsDriver::Open() in
  src/drivers/usb/generic/generic-usb.c held write_lock(_ups) across
  _hidups.Open(), which enumerates the USB bus and opens the device.

─── Flaw 2: libusb-0.1 timeout mechanism fails in multi-threaded
            processes ─────────────────────────────────────────────

HidUps::GetReport() (src/libusbhid/HidUps.cpp) calls usb_control_msg()
with a 1000 ms timeout:

    actlen = usb_control_msg(_fd, ..., 1000);

libusb-0.1 implements this timeout via SIGALRM. In a multi-threaded
process, SIGALRM is delivered to an arbitrary thread — not necessarily
the one blocked in USBDEVFS_REAPURB. The ioctl therefore never receives
EINTR, and the 1-second timeout silently misfires. When the UPS USB
interface stalls (firmware hiccup or USB bus error), USBDEVFS_REAPURB
blocks indefinitely.

This is a known limitation of libusb-0.1; libusb-1.0 addresses it with
a proper async API.

─── Combined effect ────────────────────────────────────────────────

  1. A USB/MODBUS transaction fails to complete (cause unknown —
     UPS, USB controller, or bus).
  2. Main thread blocks in USBDEVFS_REAPURB while holding ups->mutex.
  3. Every incoming NIS connection spawns a thread that immediately
     blocks at P(ups->mutex) inside attach_ups().
  4. Each blocked thread has accepted a socket FD. The remote end
     eventually closes (sending FIN), transitioning the socket to
     CLOSE-WAIT. The thread never calls net_recv() or net_close()
     because it is blocked before that point.
  5. CLOSE-WAIT sockets accumulate without bound. With RLIMIT_NOFILE
     raised to 65536 (from the first fix), accumulation continued
     undetected for ~14.5 hours before the daemon was restarted.

RELATIONSHIP TO THE FIRST FIX

The first fix resolved the NIS mutex deadlock and, as a necessary
consequence, RLIMIT_NOFILE was raised from 1024 to 65536.

This second bug existed prior to the first fix, but its effect was
masked: a USB stall would produce FD exhaustion (indistinguishable in
symptom from the write_nbytes bug) at 1024 FDs, much sooner. With the
NIS mutex bug fixed and the FD limit raised, the second bug became the
dominant failure mode and accumulated long enough to be diagnosed
cleanly.

The trigger itself is an independent event whose precise cause is
unknown. It is not caused by the first fix.

TIMELINE (all times UTC)

  2026-05-15 10:02:32  apcupsd restarted with first fix deployed
  2026-05-15 ~15:00    USB/MODBUS transaction stalls (cause unknown)
  2026-05-15 15:59:36  Remote monitor declares NIS comms lost; alert sent
                       (seeder.siblings.net, 192.168.122.33 — the same
                       host visible in all CLOSE-WAIT connections)
  2026-05-16 06:27:30  apcupsd restarted with second fix deployed;
                       2581 CLOSE-WAIT sockets cleared immediately

FIX

The same principle as the first fix: do not hold the mutex across
blocking I/O. Take write_lock only for the brief struct update after
I/O completes.

  Fix A — src/drivers/modbus/modbus.cpp

    UpdateCi(): acquire write_lock after _comm->ReadRegister() returns
    and the data is fully parsed into local variables; release after the
    switch statement that updates the _ups struct. Callers
    read_volatile_data() and read_static_data() remove their outer
    write_lock/_unlock wrappers entirely, since UpdateCi() now owns the
    lock. read_volatile_data() takes a brief write_lock only to update
    _ups->poll_time after the I/O loop completes.

  Fix B — src/drivers/usb/usb.c

    usb_update_value(): acquire write_lock after usb_get_value() (which
    calls the driver's pusb_get_value() and therefore any USB I/O)
    returns; release after usb_process_value() updates the _ups struct.
    read_volatile_data() and read_static_data() remove their outer
    write_lock/_unlock; read_volatile_data() takes a brief write_lock
    for the initial _ups->poll_time and Status bookkeeping, then
    releases before the I/O loop.

  Fix C — src/drivers/usb/generic/generic-usb.c

    Open(): move _hidups.Open() (USB bus enumeration and device open)
    outside write_lock. Retain write_lock only for the two _ups struct
    field assignments (_ups->fd, _ups->clear_slave()) that follow.

── Patch ──────────────────────────────────────────────────────────

--- src/drivers/modbus/modbus.cpp.orig
+++ src/drivers/modbus/modbus.cpp
@@ UpdateCi() — after "delete [] data;" @@
+   /* MODBUS I/O complete; hold write_lock only for the brief struct
+    * update below. Callers must not hold write_lock on entry. */
+   write_lock(_ups);
+
    astring tmpstr;
    struct tm tmp;
    time_t date;
    switch (info->ci)
    {
    ...all cases unchanged...
    }
+   write_unlock(_ups);
    return true;

@@ read_static_data() @@
 bool ModbusUpsDriver::read_static_data()
 {
-   write_lock(_ups);
-   bool ret = UpdateCis(false);
-   write_unlock(_ups);
-   return ret;
+   /* UpdateCi() takes write_lock only for each struct update. */
+   return UpdateCis(false);
 }

@@ read_volatile_data() @@
 bool ModbusUpsDriver::read_volatile_data()
 {
-   write_lock(_ups);
-   bool ret = UpdateCis(true);
-   if (ret)
-   {
-      _ups->poll_time = time(NULL);
-   }
-   write_unlock(_ups);
-   return ret;
+   bool ret = UpdateCis(true);
+   if (ret)
+   {
+      write_lock(_ups);
+      _ups->poll_time = time(NULL);
+      write_unlock(_ups);
+   }
+   return ret;
 }

--- src/drivers/usb/usb.c.orig
+++ src/drivers/usb/usb.c
@@ usb_update_value() @@
 bool UsbUpsDriver::usb_update_value(int ci)
 {
    USB_VALUE uval;
-   if (!usb_get_value(ci, &uval))
-      return false;
-   usb_process_value(ci, &uval);
-   return true;
+   /* USB I/O without write_lock: libusb-0.1's SIGALRM-based timeout
+    * is unreliable in multi-threaded processes; usb_control_msg() can
+    * block indefinitely if the UPS stalls, deadlocking NIS threads. */
+   if (!usb_get_value(ci, &uval))
+      return false;
+   write_lock(_ups);
+   usb_process_value(ci, &uval);
+   write_unlock(_ups);
+   return true;
 }

@@ read_volatile_data() @@
-   write_lock(_ups);
-   _ups->poll_time = now;
-   _ups->Status &= ~0xFF;
-   for (int i=0; _known_info[i].usage_code; i++) {
-      if (_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
-         usb_update_value(_known_info[i].ci);
-   }
-   write_unlock(_ups);
+   /* Hold write_lock only for initial bookkeeping; usb_update_value()
+    * takes it briefly for each struct update after USB I/O completes. */
+   write_lock(_ups);
+   _ups->poll_time = now;
+   _ups->Status &= ~0xFF;
+   write_unlock(_ups);
+   for (int i=0; _known_info[i].usage_code; i++) {
+      if (_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
+         usb_update_value(_known_info[i].ci);
+   }
    return 1;

@@ read_static_data() @@
 bool UsbUpsDriver::read_static_data()
 {
-   write_lock(_ups);
-   for (int i=0; _known_info[i].usage_code; i++) {
-      if (!_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
-         usb_update_value(_known_info[i].ci);
-   }
-   write_unlock(_ups);
+   /* usb_update_value() handles its own write_lock. */
+   for (int i=0; _known_info[i].usage_code; i++) {
+      if (!_known_info[i].isvolatile && _known_info[i].ci != CI_NONE)
+         usb_update_value(_known_info[i].ci);
+   }
    return 1;
 }

--- src/drivers/usb/generic/generic-usb.c.orig
+++ src/drivers/usb/generic/generic-usb.c
@@ Open() @@
 bool GenericUsbUpsDriver::Open()
 {
-   write_lock(_ups);
-   bool rc = _hidups.Open(_ups->device);
-   _ups->fd = 1;
-   _ups->clear_slave();
-   write_unlock(_ups);
+   /* USB device enumeration can block; no lock during I/O. */
+   bool rc = _hidups.Open(_ups->device);
+   write_lock(_ups);
+   _ups->fd = 1;
+   _ups->clear_slave();
+   write_unlock(_ups);
    return rc;
 }

NOTES

• The comment in src/lib/apcstatus.c (the output_status() function)
  explicitly acknowledges the long-lock problem from the NIS reader's
  side ("Note that this creates the potential for a misbehaving client
  to create long lock hold times, essentially locking the driver out")
  and even suggests the correct fix ("making a local copy of UPSINFO
  under the lock, then releasing the lock and feeding the client from
  the copy"). The driver side has the symmetric problem, with no
  analogous comment.

• check_state() in both the linux and generic USB drivers already
  performs USB interrupt reads (select() + read()) without holding
  write_lock, acquiring it only briefly when a changed value is found.
  The polling path (read_volatile_data) should follow the same pattern,
  and now does.

• check_state() in modbus.cpp calls UpdateCi(CI_STATUS) without holding
  write_lock (a pre-existing omission). After this fix, UpdateCi()
  takes write_lock internally, so that call is now also protected.

• The trigger is a USB/MODBUS transaction that fails to complete —
  whether due to UPS firmware, USB controller behaviour, or some other
  cause is unknown. It is an independent event, not reproducible on
  demand in the way the first bug was. The fix is verified by code
  inspection and by the clean disappearance of CLOSE-WAIT accumulation
  after the patched binary was started. A recurrence would require
  another such event.

• This analysis, test script, and patch were developed with the assistance
  of Claude Code (Anthropic), model claude-sonnet-4-6, on 2026-05-16.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2152692

Title:
  apcupsd NIS server deadlock: write_nbytes() has no send timeout, mutex
  held across network I/O → FD exhaustion under broken TCP

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/apcupsd/+bug/2152692/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2152692] Re: apcupsd NIS server deadlock: write_nbytes() has no send timeout, mutex held across network I/O → FD exhaustion under broken TCP

Reply via email to