On Wed, Apr 14, 2021 at 01:53:06PM +0200, Christopher Faulet wrote:
> > nbthread=64, nbproc=1 on both 1.8/2.x
> 
> It is thus surprising, if it is really a contention issue, that you never
> observed slow down on the 1.8. There is no watchdog, but the thread
> implementation is a bit awkward on the 1.8. 2.X are better on this point,
> the best being the 2.4.

Agreed, I'd even say that 64 threads in 1.8 should be wayyyy slower than
a single thread.

A few things are worth having a look at, Robin:
  - please run "perf top" when this happens, and when you see a function
    reporting a high usage, do no hesitate to navigate through it, pressing
    enter, and "annotate function <foobar>". Then scrolling through it will
    reveal some percentage of time certain points were met. It's very possible
    you'll find that 100% of the CPU are used in 1, 2 or 3 functions and
    that there is a logic error somewhere. Usually if you find a single one
    you'll end up spotting a lock.

  - please also check if your machine is not swapping, as this can have
    terrible consequences and could explain why it only happens on certain
    machines dealing with certain workloads. I remember having spent several
    weeks many years ago chasing a response time issue happening only in the
    morning, which was in fact caused by the log upload batch having swapped
    and left a good part of the unused memory in the swap, making it very
    difficult for the network stack to allocate buffers during send() and
    recv(), thus causing losses and retransmits as the load grew. This was
    never reproducible in the lab because we were not syncing logs :-)

Willy

Reply via email to