On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote: > > nbthread=64, nbproc=1 on both 1.8/2.x > > It is thus surprising, if it is really a contention issue, that you never > observed slow down on the 1.8. There is no watchdog, but the thread > implementation is a bit awkward on the 1.8. 2.X are better on this point, > the best being the 2.4.
Agreed, I'd even say that 64 threads in 1.8 should be wayyyy slower than a single thread. A few things are worth having a look at, Robin: - please run "perf top" when this happens, and when you see a function reporting a high usage, do no hesitate to navigate through it, pressing enter, and "annotate function <foobar>". Then scrolling through it will reveal some percentage of time certain points were met. It's very possible you'll find that 100% of the CPU are used in 1, 2 or 3 functions and that there is a logic error somewhere. Usually if you find a single one you'll end up spotting a lock. - please also check if your machine is not swapping, as this can have terrible consequences and could explain why it only happens on certain machines dealing with certain workloads. I remember having spent several weeks many years ago chasing a response time issue happening only in the morning, which was in fact caused by the log upload batch having swapped and left a good part of the unused memory in the swap, making it very difficult for the network stack to allocate buffers during send() and recv(), thus causing losses and retransmits as the load grew. This was never reproducible in the lab because we were not syncing logs :-) Willy