Le 16/03/2021 à 13:46, Maciej Zdeb a écrit :
Sorry for spam. In the last message I said that the old process (after reload) is consuming cpu for lua processing and that's not true, it is processing other things also.

I'll take a break. ;) Then I'll verify if the issue exists on 2.3 and maybe 2.4 branch. For each version I need a week or two to be sure the issue does not occur. :(

If 2.3 and 2.4 behaves the same way the 2.2 does, I'll try to confirm if there is any relation between infinite loops and custom configuration:
- lua scripts (mainly used for header generation/manipulation),
- spoe (used for sending metadata about each request to external service),
- peers (we have a cluster of 12 HAProxy servers connected to each other).

Hi Maciej,

I've read more carefully your backtraces, and indeed, it seems to be related to lua processing. I don't know if the watchdog is triggered because of the lua or if it is just a side-effect. But the lua execution is interrupted inside the memory allocator. And malloc/realloc are not async-signal-safe. Unfortunately, when the lua stack is dumped, the same allocator is also used. At this stage, because a lock was not properly released, HAProxy enter in a deadlock.

On other threads, we loop in the watchdog, waiting for the hand to dump the thread information and that explains the 100% CPU usage you observe.

So, to prevent this situation, the lua stack must not be dumped if it was interrupted inside an unsafe part. It is the easiest way we found to workaround this bug. And because it is pretty rare, it should be good enough.

However, I'm unable to reproduce the bug. Could you test attached patches please ? I attached patched for the 2.4, 2.3 and 2.2. Because you experienced this bug on the 2.2, it is probably easier to test patches for this version.

If fixed, it could be good to figure out why the watchdog is triggered on your old processes.

--
Christopher Faulet
>From a61789a1d62fd71c751189faf5371740dd375f33 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:16:28 +0100
Subject: [PATCH 1/2] MEDIUM: lua: Use a per-thread counter to track some
 non-reentrant parts of lua

Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.

In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
---
 include/haproxy/hlua.h |  1 +
 src/hlua.c             | 13 +++++++++++--
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/haproxy/hlua.h b/include/haproxy/hlua.h
index 6aca17f39..36629e637 100644
--- a/include/haproxy/hlua.h
+++ b/include/haproxy/hlua.h
@@ -50,6 +50,7 @@ void hlua_applet_tcp_fct(struct appctx *ctx);
 void hlua_applet_http_fct(struct appctx *ctx);
 struct task *hlua_process_task(struct task *task, void *context, unsigned int state);
 
+extern THREAD_LOCAL unsigned int hlua_not_dumpable;
 #else /* USE_LUA */
 
 /************************ For use when Lua is disabled ********************/
diff --git a/src/hlua.c b/src/hlua.c
index 42d5be2a5..962195a60 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -274,6 +274,9 @@ struct hlua_mem_allocator {
 
 static struct hlua_mem_allocator hlua_global_allocator THREAD_ALIGNED(64);
 
+ /* > 0 if lua is in a non-rentrant part, thus with a non-dumpable stack */
+THREAD_LOCAL unsigned int hlua_not_dumpable = 0;
+
 /* These functions converts types between HAProxy internal args or
  * sample and LUA types. Another function permits to check if the
  * LUA stack contains arguments according with an required ARG_T
@@ -8634,8 +8637,12 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 	/* a limit of ~0 means unlimited and boot complete, so there's no need
 	 * for accounting anymore.
 	 */
-	if (likely(~zone->limit == 0))
-		return realloc(ptr, nsize);
+	if (likely(~zone->limit == 0)) {
+		hlua_not_dumpable++;
+		ptr = realloc(ptr, nsize);
+		hlua_not_dumpable--;
+		return ptr;
+	}
 
 	if (!ptr)
 		osize = 0;
@@ -8649,7 +8656,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 			return NULL;
 	} while (!_HA_ATOMIC_CAS(&zone->allocated, &old, new));
 
+	hlua_not_dumpable++;
 	ptr = realloc(ptr, nsize);
+	hlua_not_dumpable--;
 
 	if (unlikely(!ptr && nsize)) // failed
 		_HA_ATOMIC_SUB(&zone->allocated, nsize - osize);
-- 
2.30.2

>From 83926a04febe94f332c6f45c98773286678346d3 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:41:08 +0100
Subject: [PATCH 2/2] BUG/MEDIUM: debug/lua: Don't dump the lua stack if not
 dumpable

When we try to dump the stack of a lua context, if it is not dumpable,
nothing is performed and a message is emitted instead. This happens when a
lua execution was interrupted inside a non-reentrant part.

This patch depends on following commit :

 * MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Thanks to this patch, we avoid a possible deadllock if the lua is
interrupted by the watchdog in the lua memory allocator, because realloc()
is not async-signal-safe.

Both patches must be backported as far as 2.0.
---
 src/debug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/debug.c b/src/debug.c
index f86d05490..934682857 100644
--- a/src/debug.c
+++ b/src/debug.c
@@ -264,9 +264,13 @@ void ha_task_dump(struct buffer *buf, const struct task *task, const char *pfx)
 	}
 
 	if (hlua && hlua->T) {
-		luaL_traceback(hlua->T, hlua->T, NULL, 0);
-		if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
-			b_putchr(buf, '\n');
+		if (hlua_not_dumpable == 0) {
+			luaL_traceback(hlua->T, hlua->T, NULL, 0);
+			if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
+				b_putchr(buf, '\n');
+		}
+		else
+			chunk_appendf(buf, "Inside non-rentrant part, Stack traceback not available\n");
 	}
 	else
 		b_putchr(buf, '\n');
-- 
2.30.2

>From 7e5881ad8a5829ed8d162c527ab91e9a7ae45eae Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:16:28 +0100
Subject: [PATCH 1/2] MEDIUM: lua: Use a per-thread counter to track some
 non-reentrant parts of lua

Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.

In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
---
 include/haproxy/hlua.h | 1 +
 src/hlua.c             | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/haproxy/hlua.h b/include/haproxy/hlua.h
index 5d1153bb4..29656258e 100644
--- a/include/haproxy/hlua.h
+++ b/include/haproxy/hlua.h
@@ -50,6 +50,7 @@ void hlua_applet_tcp_fct(struct appctx *ctx);
 void hlua_applet_http_fct(struct appctx *ctx);
 struct task *hlua_process_task(struct task *task, void *context, unsigned short state);
 
+extern THREAD_LOCAL unsigned int hlua_not_dumpable;
 #else /* USE_LUA */
 
 /************************ For use when Lua is disabled ********************/
diff --git a/src/hlua.c b/src/hlua.c
index 82445930e..990080b8c 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -237,6 +237,9 @@ struct hlua_mem_allocator {
 
 static struct hlua_mem_allocator hlua_global_allocator;
 
+ /* > 0 if lua is in a non-rentrant part, thus with a non-dumpable stack */
+THREAD_LOCAL unsigned int hlua_not_dumpable = 0;
+
 /* These functions converts types between HAProxy internal args or
  * sample and LUA types. Another function permits to check if the
  * LUA stack contains arguments according with an required ARG_T
@@ -8241,7 +8244,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		/* it's a free */
 		if (ptr)
 			zone->allocated -= osize;
+		hlua_not_dumpable++;
 		free(ptr);
+		hlua_not_dumpable--;
 		return NULL;
 	}
 
@@ -8250,7 +8255,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		if (zone->limit && zone->allocated + nsize > zone->limit)
 			return NULL;
 
+		hlua_not_dumpable++;
 		ptr = malloc(nsize);
+		hlua_not_dumpable--;
 		if (ptr)
 			zone->allocated += nsize;
 		return ptr;
@@ -8260,7 +8267,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 	if (zone->limit && zone->allocated + nsize - osize > zone->limit)
 		return NULL;
 
+	hlua_not_dumpable++;
 	ptr = realloc(ptr, nsize);
+	hlua_not_dumpable--;
 	if (ptr)
 		zone->allocated += nsize - osize;
 	return ptr;
-- 
2.30.2

>From 063f1235402bcc01e19b79c229ac14177bfa4319 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:41:08 +0100
Subject: [PATCH 2/2] BUG/MEDIUM: debug/lua: Don't dump the lua stack if not
 dumpable

When we try to dump the stack of a lua context, if it is not dumpable,
nothing is performed and a message is emitted instead. This happens when a
lua execution was interrupted inside a non-reentrant part.

This patch depends on following commit :

 * MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Thanks to this patch, we avoid a possible deadllock if the lua is
interrupted by the watchdog in the lua memory allocator, because realloc()
is not async-signal-safe.

Both patches must be backported as far as 2.0.
---
 src/debug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/debug.c b/src/debug.c
index 30e155964..a47fe1000 100644
--- a/src/debug.c
+++ b/src/debug.c
@@ -223,9 +223,13 @@ void ha_task_dump(struct buffer *buf, const struct task *task, const char *pfx)
 	}
 
 	if (hlua && hlua->T) {
-		luaL_traceback(hlua->T, hlua->T, NULL, 0);
-		if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
-			b_putchr(buf, '\n');
+		if (hlua_not_dumpable == 0) {
+			luaL_traceback(hlua->T, hlua->T, NULL, 0);
+			if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
+				b_putchr(buf, '\n');
+		}
+		else
+			chunk_appendf(buf, "Inside non-rentrant part, Stack traceback not available\n");
 	}
 	else
 		b_putchr(buf, '\n');
-- 
2.30.2

>From b8ebd1cb39a84c9ca0bc7b426ac20a5b079170f9 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:16:28 +0100
Subject: [PATCH 1/2] MEDIUM: lua: Use a per-thread counter to track some
 non-reentrant parts of lua

Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.

In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.
---
 include/haproxy/hlua.h | 1 +
 src/hlua.c             | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/haproxy/hlua.h b/include/haproxy/hlua.h
index 5d1153bb4c..29656258e5 100644
--- a/include/haproxy/hlua.h
+++ b/include/haproxy/hlua.h
@@ -50,6 +50,7 @@ void hlua_applet_tcp_fct(struct appctx *ctx);
 void hlua_applet_http_fct(struct appctx *ctx);
 struct task *hlua_process_task(struct task *task, void *context, unsigned short state);
 
+extern THREAD_LOCAL unsigned int hlua_not_dumpable;
 #else /* USE_LUA */
 
 /************************ For use when Lua is disabled ********************/
diff --git a/src/hlua.c b/src/hlua.c
index 8656860d5f..5bef67a48b 100644
--- a/src/hlua.c
+++ b/src/hlua.c
@@ -236,6 +236,9 @@ struct hlua_mem_allocator {
 
 static struct hlua_mem_allocator hlua_global_allocator;
 
+ /* > 0 if lua is in a non-rentrant part, thus with a non-dumpable stack */
+THREAD_LOCAL unsigned int hlua_not_dumpable = 0;
+
 /* These functions converts types between HAProxy internal args or
  * sample and LUA types. Another function permits to check if the
  * LUA stack contains arguments according with an required ARG_T
@@ -8192,7 +8195,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		/* it's a free */
 		if (ptr)
 			zone->allocated -= osize;
+		hlua_not_dumpable++;
 		free(ptr);
+		hlua_not_dumpable--;
 		return NULL;
 	}
 
@@ -8201,7 +8206,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 		if (zone->limit && zone->allocated + nsize > zone->limit)
 			return NULL;
 
+		hlua_not_dumpable++;
 		ptr = malloc(nsize);
+		hlua_not_dumpable--;
 		if (ptr)
 			zone->allocated += nsize;
 		return ptr;
@@ -8211,7 +8218,9 @@ static void *hlua_alloc(void *ud, void *ptr, size_t osize, size_t nsize)
 	if (zone->limit && zone->allocated + nsize - osize > zone->limit)
 		return NULL;
 
+	hlua_not_dumpable++;
 	ptr = realloc(ptr, nsize);
+	hlua_not_dumpable--;
 	if (ptr)
 		zone->allocated += nsize - osize;
 	return ptr;
-- 
2.30.2

>From 0f5b7fa916c032b4e9a193d96bf24ef3c3289bb3 Mon Sep 17 00:00:00 2001
From: Christopher Faulet <cfau...@haproxy.com>
Date: Fri, 19 Mar 2021 15:41:08 +0100
Subject: [PATCH 2/2] BUG/MEDIUM: debug/lua: Don't dump the lua stack if not
 dumpable

When we try to dump the stack of a lua context, if it is not dumpable,
nothing is performed and a message is emitted instead. This happens when a
lua execution was interrupted inside a non-reentrant part.

This patch depends on following commit :

 * MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Thanks to this patch, we avoid a possible deadllock if the lua is
interrupted by the watchdog in the lua memory allocator, because realloc()
is not async-signal-safe.

Both patches must be backported as far as 2.0.
---
 src/debug.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/debug.c b/src/debug.c
index b2a4c6bacb..028271a2e3 100644
--- a/src/debug.c
+++ b/src/debug.c
@@ -223,9 +223,13 @@ void ha_task_dump(struct buffer *buf, const struct task *task, const char *pfx)
 	}
 
 	if (hlua && hlua->T) {
-		luaL_traceback(hlua->T, hlua->T, NULL, 0);
-		if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
-			b_putchr(buf, '\n');
+		if (hlua_not_dumpable == 0) {
+			luaL_traceback(hlua->T, hlua->T, NULL, 0);
+			if (!append_prefixed_str(buf, lua_tostring(hlua->T, -1), pfx, '\n', 1))
+				b_putchr(buf, '\n');
+		}
+		else
+			chunk_appendf(buf, "Inside non-rentrant part, Stack traceback not available\n");
 	}
 	else
 		b_putchr(buf, '\n');
-- 
2.30.2

Reply via email to