On Thu, Apr 15, 2021 at 07:13:53AM +0000, Robin H. Johnson wrote:
> Thanks; I will need to catch it faster or automate this, because the
> watchdog does a MUCH better job restarting it than before, less than 30
> seconds of 100% CPU before the watchdog reliably kills it.

I see. Then collecting the watchdog outputs can be instructive to see
if it always happens at the same place or not. And the core dumps will
indicate what all threads were doing (and if some were competing on a
lock for example).

> >   - please also check if your machine is not swapping, as this can have
> >     terrible consequences and could explain why it only happens on certain
> >     machines dealing with certain workloads. I remember having spent several
> >     weeks many years ago chasing a response time issue happening only in the
> >     morning, which was in fact caused by the log upload batch having swapped
> >     and left a good part of the unused memory in the swap, making it very
> >     difficult for the network stack to allocate buffers during send() and
> >     recv(), thus causing losses and retransmits as the load grew. This was
> >     never reproducible in the lab because we were not syncing logs :-)
> 512GiB RAM and no swap configured on the system at all.

:-)

> Varnish runs on the same host and is used to cache some of the backends.
> Please of free memory at the moment.

I'm now thinking about something. Do you have at least as many CPUs as the
total number of threads used by haproxy and varnish ? Otherwise there will
be some competition and migrations will happen. If neither is bounded, you
can even end up with two haproxy threads forced to run on the same CPU,
which is the worst situation as one could be scheduled out with a lock
held and the other one spinning waiting for this lock.

With half a TB of RAM I guess you have multiple sockets. Could you at
least pin haproxy to the CPUs of a single socket (that's the bare minimum
to do to preserve performance, as atomics and locks over UPI/QPI are a
disaster), and ideally pin varnish to another socket ?

Or maybe just enable less threads for haproxy if you don't need that many
and make sure the CPUs it's bound to are not used by varnish ?

In such a setup combining several high-performance processes, it's really
important to reserve the resources to them, and you must count on the
number of CPUs needed to deal with network interrupts as well (and likely
for disk if varnish uses it). Once your resources are cleanly reserved,
you'll get the maximum performance with the lowest latency.

Willy

Reply via email to