On Thu, Apr 15, 2021 at 08:59:35AM +0200, Willy Tarreau wrote:
> On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > > nbthread=64, nbproc=1 on both 1.8/2.x
> > 
> > It is thus surprising, if it is really a contention issue, that you never
> > observed slow down on the 1.8. There is no watchdog, but the thread
> > implementation is a bit awkward on the 1.8. 2.X are better on this point,
> > the best being the 2.4.
> 
> Agreed, I'd even say that 64 threads in 1.8 should be wayyyy slower than
> a single thread.
> 
> A few things are worth having a look at, Robin:
>   - please run "perf top" when this happens, and when you see a function
>     reporting a high usage, do no hesitate to navigate through it, pressing
>     enter, and "annotate function <foobar>". Then scrolling through it will
>     reveal some percentage of time certain points were met. It's very possible
>     you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
>     that there is a logic error somewhere. Usually if you find a single one
>     you'll end up spotting a lock.
Thanks; I will need to catch it faster or automate this, because the
watchdog does a MUCH better job restarting it than before, less than 30
seconds of 100% CPU before the watchdog reliably kills it.

>   - please also check if your machine is not swapping, as this can have
>     terrible consequences and could explain why it only happens on certain
>     machines dealing with certain workloads. I remember having spent several
>     weeks many years ago chasing a response time issue happening only in the
>     morning, which was in fact caused by the log upload batch having swapped
>     and left a good part of the unused memory in the swap, making it very
>     difficult for the network stack to allocate buffers during send() and
>     recv(), thus causing losses and retransmits as the load grew. This was
>     never reproducible in the lab because we were not syncing logs :-)
512GiB RAM and no swap configured on the system at all.
Varnish runs on the same host and is used to cache some of the backends.
Please of free memory at the moment.

-- 
Robin Hugh Johnson
GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85

Attachment: signature.asc
Description: PGP signature

Reply via email to