Hi list,

I attached a fix for a master CLI connection slot leak
in master-worker mode.

The master CLI proxy (mworker_proxy) has a hardcoded maxconn of 10.
When a client connects to the master CLI socket and sends a command
that gets forwarded to an unresponsive worker, the connection hangs.
If the client then disconnects the connection slot is never released, neither
the request nor the response analyzer cleans it up. After 10 such leaked
slots, the master socket becomes completely unreachable with
"Resource temporarily unavailable".

We hit this in Kubernetes deployments where readiness probes query the
master CLI socket. When workers become temporarily unresponsive under
load, probe connections time out and leak slots, eventually triggering
pod restarts.

This is tracked as GH issue #3351:
https://github.com/haproxy/haproxy/issues/3351

The patch works like this:


  1.  In pcli_wait_for_request(), when the response analyzer is active
and the frontend stream connector shows a client disconnect
(SC_FL_EOS or SC_FL_ABRT_DONE on scf), explicitly call
sc_abort(s->scb) to propagate the disconnect to the backend.
  2.  In pcli_wait_for_response(), extend the error condition to also
check for SC_FL_ABRT_DONE on scb. This flag is only set by the
explicit sc_abort() above, so normal one-shot CLI tools that
close their TCP connection after receiving a response are not
affected.

A regression test is included.
This should be backported to all stable branches.

Cheers,
Alexander

Attachment: 0001-BUG-MEDIUM-cli-fix-master-CLI-connection-slot-leak-o.patch
Description: 0001-BUG-MEDIUM-cli-fix-master-CLI-connection-slot-leak-o.patch

Reply via email to