Le 09/04/2021 à 19:26, Robin H. Johnson a écrit :
Hi,

Maciej had said they were going to create a new thread, but I didn't see
one yet.

I want to start by noting problem was much worse on 2.2.8 & 2.2.9, and
that 2.2.13 & 2.3.9 don't get entirely hung at 100% anymore: a big
thanks for that initial work in fixing the issue.

As I mentioned in my other mail asking for a 1.8.30 release, we're
experiencing this problem in DigitalOcean's HAProxy instances used to
run the Spaces product.

I've been trying to dig out deeper detail as well with a debug threads
version, but I have the baseline error output from 2.3.9 here to share,
after passing redaction review. This dump was generated with vbernat's
PPA version of 2.3.9, not any internal builds.

We have struggled to reproduce the problem in testing environments, it
only turns up at the biggest regions, and plotting occurances of the
issue over the time dimension suggest that it might have some partial
correlation w/ a weird workload input.

The dumps do suggest Lua is implicated as well, and we've got some
extensive Lua code, so it's impossible to rule it out as contributing to
the problem (We have been discussing plans to move it to SPOA instead).

The Lua code in question hasn't changed significantly in nearly 6
months, and it was problem-free on the 1.8 series (having a test suite
for the Lua code has been invaluable).


Hi,

It seems you have a blocking call in one of your lua script. The threads dump shows many threads blocked in hlua_ctx_init. Many others are executing lua. Unfortunately, for a unknown reason, there is no stack traceback.

For the 2.3 and prior, the lua scripts are executed under a global lock. Thus blocking calls in a lua script are awful, because it does not block only one thread but all others too. I guess the same issue exists on the 1.8, but there is no watchdog on this version. Thus, time to time HAProxy hangs and may report huge latencies but, at the end it recovers and continues to process data. It is exactly the purpose of the watchdog, reporting hidden bugs related to spinning loops and deadlocks.

However, I may be wrong. It may be just a contention problem because your are executing lua with 64 threads and a huge workload. In this case, you may give a try to the 2.4 (under development). There is a way to have a separate lua context for each thread loading the scripts with "lua-load-per-thread" directive. Out of curiosity, on the 1.8, are you running HAProxy with several threads or are you spawning several processes?

--
Christopher Faulet

Reply via email to