Re: [2.0.17] crash with coredump

2020-11-13 Thread Christopher Faulet

Le 11/11/2020 à 12:43, Maciej Zdeb a écrit :
Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0 and 
2.2 branches. I'll keep testing it for a couple days just to be sure.


So that stacktrace I shared before (on spoe_release_appctx function) was very 
lucky... Do you think that it'd be possible to find the bug without the 
replication procedure?


Christopher & Willy many thanks for your hard work! I'm always impressed how 
fast you're able to narrow the bug when you finally get proper input from a 
reporter. :)




This patch is now merged and backported as far as 2.0. So many thanks Maciej !

--
Christopher Faulet



Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
śr., 11 lis 2020 o 12:53 Willy Tarreau  napisał(a):

> Two months of chasing a non reproducible
> memory corruption with zero initial info is quite an achievement, many
> thanks for doing that!
>

Initially it crashed (once every few hours) only on our most critical
HAProxy servers and with SPOA from external vendor, then on less critical
but still production servers. It took a lot of time to analyze our config
and not break anything. I was almost sure it's something specific to our
configuration. Especially nobody else reported similar problems with SPOE.

Your patch with additional checks was a game changer, it was easier to
trigger the crash thus easier to replicate! :)


Re: [2.0.17] crash with coredump

2020-11-11 Thread Willy Tarreau
On Wed, Nov 11, 2020 at 12:43:50PM +0100, Maciej Zdeb wrote:
> Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0
> and 2.2 branches. I'll keep testing it for a couple days just to be sure.
> 
> So that stacktrace I shared before (on spoe_release_appctx function) was
> very lucky... Do you think that it'd be possible to find the bug without
> the replication procedure?

Very hardly. I actually continued on your indications and noticed that
each time I had a crash, a pointer that was supposedly aligned had
regressed by one. This reminded me of the NULL pointer that became -1.
I thought it was related to the pools since it often crashed there, and
in parallel Christopher looked for decrements in the SPOE part and found
that some nulls were missing there on aborts.

> Christopher & Willy many thanks for your hard work!

Let me return you the compliment! Two months of chasing a non reproducible
memory corruption with zero initial info is quite an achievement, many
thanks for doing that!

> I'm always impressed
> how fast you're able to narrow the bug when you finally get proper input
> from a reporter. :)

It's very simple, the code is huge and any piece could be responsible for
any problem. Sometimes you have a good nose and manage to narrow down the
issue in an area. Sometimes you just read a piece of code and figure it
can do something nasty. Sometimes other reports come in and help rule out
other hypothesis. But when there's nothing logical, most often it's a memory
corruption and then there's no other solution than being able to observe it
live and heavily instrument the code to go back in time from the crash to
the cause. In your case we were lucky, threads were not involved, otherwise
this adds another dimension, and very often the instrumentation code changes
the timings and makes the issue disappear :-)

Cheers,
Willy



Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0
and 2.2 branches. I'll keep testing it for a couple days just to be sure.

So that stacktrace I shared before (on spoe_release_appctx function) was
very lucky... Do you think that it'd be possible to find the bug without
the replication procedure?

Christopher & Willy many thanks for your hard work! I'm always impressed
how fast you're able to narrow the bug when you finally get proper input
from a reporter. :)

wt., 10 lis 2020 o 22:30 Willy Tarreau  napisał(a):

> Hi Christopher,
>
> On Tue, Nov 10, 2020 at 09:17:15PM +0100, Christopher Faulet wrote:
> > Le 10/11/2020 à 18:12, Maciej Zdeb a écrit :
> > > Hi,
> > >
> > > I'm so happy you're able to replicate it! :)
> > >
> > > With that patch that disabled pool_flush I still can reproduce on my
> r
> > > server and on production, just different places of crash:
> > >
> >
> > Hi Maciej,
> >
> > Could you test the following patch please ? For now I don't know if it
> fully
> > fixes the bug. But it is step forward. I must do a deeper review to be
> sure
> > it covers all cases.
>
> Looks like you got it right this time, not only it doesn't crash anymore
> in my tests, the suspiciously wrong cur_fap that were going negative very
> quickly do not happen anymore either! This is a very good news! Looking
> forward to read about Maciej's tests.
>
> Cheers,
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
Hi Christopher,

On Tue, Nov 10, 2020 at 09:17:15PM +0100, Christopher Faulet wrote:
> Le 10/11/2020 à 18:12, Maciej Zdeb a écrit :
> > Hi,
> > 
> > I'm so happy you're able to replicate it! :)
> > 
> > With that patch that disabled pool_flush I still can reproduce on my r
> > server and on production, just different places of crash:
> > 
> 
> Hi Maciej,
> 
> Could you test the following patch please ? For now I don't know if it fully
> fixes the bug. But it is step forward. I must do a deeper review to be sure
> it covers all cases.

Looks like you got it right this time, not only it doesn't crash anymore
in my tests, the suspiciously wrong cur_fap that were going negative very
quickly do not happen anymore either! This is a very good news! Looking
forward to read about Maciej's tests.

Cheers,
Willy



Re: [2.0.17] crash with coredump

2020-11-10 Thread Christopher Faulet

Le 10/11/2020 à 18:12, Maciej Zdeb a écrit :

Hi,

I'm so happy you're able to replicate it! :)

With that patch that disabled pool_flush I still can reproduce on my r server 
and on production, just different places of crash:




Hi Maciej,

Could you test the following patch please ? For now I don't know if it fully 
fixes the bug. But it is step forward. I must do a deeper review to be sure it 
covers all cases.


Thanks !

--
Christopher Faulet
>From 7b1996335f8bd33fc3180003dfb57c4d55fa6a60 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Tue, 10 Nov 2020 18:45:34 +0100
Subject: [PATCH] BUG/MEDIUM: spoe: Be sure to remove all references on a
 released spoe applet

When a SPOE applet is used to send a frame, a reference on this applet is saved
in the spoe context of the offladed stream. But, if the applet is released
before receving the corresponding ack, we must be sure to remove this
reference. This was performed for fragmented frames only. But it must also be
performed for a spoe contexts in the applet waiting_queue and in the thread
waiting queue (used in async mode).

This patch must be backported to all versions where the spoe is supported (>=
1.7).
---
 src/flt_spoe.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/flt_spoe.c b/src/flt_spoe.c
index a91906e105..33b312688e 100644
--- a/src/flt_spoe.c
+++ b/src/flt_spoe.c
@@ -1253,6 +1253,7 @@ spoe_release_appctx(struct appctx *appctx)
 		LIST_INIT(>list);
 		_HA_ATOMIC_SUB(>counters.nb_waiting, 1);
 		spoe_update_stat_time(>stats.tv_wait, >stats.t_waiting);
+		ctx->spoe_appctx = NULL;
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
 		TEST_STRM(ctx->strm);
@@ -1270,8 +1271,13 @@ spoe_release_appctx(struct appctx *appctx)
 		task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	}
 
-	if (!LIST_ISEMPTY(>rt[tid].applets))
+	if (!LIST_ISEMPTY(>rt[tid].applets)) {
+		list_for_each_entry_safe(ctx, back, >rt[tid].waiting_queue, list) {
+			if (ctx->spoe_appctx == spoe_appctx)
+ctx->spoe_appctx = NULL;
+		}
 		goto end;
+	}
 
 	/* If this was the last running applet, notify all waiting streams */
 	list_for_each_entry_safe(ctx, back, >rt[tid].sending_queue, list) {
@@ -1279,6 +1285,7 @@ spoe_release_appctx(struct appctx *appctx)
 		LIST_INIT(>list);
 		_HA_ATOMIC_SUB(>counters.nb_sending, 1);
 		spoe_update_stat_time(>stats.tv_queue, >stats.t_queue);
+		ctx->spoe_appctx = NULL;
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
 		TEST_STRM(ctx->strm);
@@ -1289,6 +1296,7 @@ spoe_release_appctx(struct appctx *appctx)
 		LIST_INIT(>list);
 		_HA_ATOMIC_SUB(>counters.nb_waiting, 1);
 		spoe_update_stat_time(>stats.tv_wait, >stats.t_waiting);
+		ctx->spoe_appctx = NULL;
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
 		TEST_STRM(ctx->strm);
-- 
2.26.2



Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi,

I'm so happy you're able to replicate it! :)

With that patch that disabled pool_flush I still can reproduce on my r
server and on production, just different places of crash:

on r:
(gdb) bt
#0  tasklet_wakeup (tl=0xd720c300a000) at include/haproxy/task.h:328
#1  h2s_notify_recv (h2s=h2s@entry=0x55d720c2d500) at src/mux_h2.c:1037
#2  0x55d71f44d3a0 in h2s_notify_recv (h2s=0x55d720c2d500) at
include/haproxy/trace.h:150
#3  h2s_close (h2s=0x55d720c2d500) at src/mux_h2.c:1236
#4  0x55d71f450c26 in h2s_frt_make_resp_headers (htx=0x55d720ae4c90,
h2s=0x55d720c2d500) at src/mux_h2.c:4795
#5  h2_snd_buf (cs=0x55d720c31000, buf=0x55d720c2d888, count=182,
flags=) at src/mux_h2.c:5888
#6  0x55d71f4fb9fa in si_cs_send (cs=0x55d720c31000) at
src/stream_interface.c:737
#7  0x55d71f4fc2c0 in si_sync_send (si=si@entry=0x55d720c2db48) at
src/stream_interface.c:914
#8  0x55d71f49ea91 in process_stream (t=,
context=0x55d720c2d810, state=) at src/stream.c:2245
#9  0x55d71f55cfe9 in run_tasks_from_list (list=list@entry=0x55d71f96cb40
, max=max@entry=149) at src/task.c:371
#10 0x55d71f55d7ca in process_runnable_tasks () at src/task.c:519
#11 0x55d71f517c15 in run_poll_loop () at src/haproxy.c:2900
#12 0x55d71f517fc9 in run_thread_poll_loop (data=) at
src/haproxy.c:3065
#13 0x55d71f3ef87e in main (argc=, argv=0x7fff7a4ef218)
at src/haproxy.c:3767

on production:
#0  0x557070df32d6 in h2s_notify_recv (h2s=0x7ff0fc696670) at
src/mux_h2.c:1035
#1  h2s_close (h2s=0x7ff0fc696670) at src/mux_h2.c:1236
#2  0x557070df7922 in h2s_frt_make_resp_data (count=,
buf=0x7ff0ec87ee78, h2s=0x7ff0fc696670) at src/mux_h2.c:5466
#3  h2_snd_buf (cs=0x7ff118af9790, buf=0x7ff0ec87ee78, count=3287,
flags=) at src/mux_h2.c:5903
#4  0x557070ea19fa in si_cs_send (cs=cs@entry=0x7ff118af9790) at
src/stream_interface.c:737
#5  0x557070ea1c5b in stream_int_chk_snd_conn (si=0x7ff0ec87f138) at
src/stream_interface.c:1121
#6  0x557070e9f112 in si_chk_snd (si=0x7ff0ec87f138) at
include/haproxy/stream_interface.h:488
#7  stream_int_notify (si=si@entry=0x7ff0ec87f190) at
src/stream_interface.c:490
#8  0x557070ea1f48 in si_cs_process (cs=cs@entry=0x7ff0fc93d9d0) at
src/stream_interface.c:624
#9  0x557070ea31fa in si_cs_io_cb (t=,
ctx=0x7ff0ec87f190, state=) at src/stream_interface.c:792
#10 0x557070f030ed in run_tasks_from_list (list=list@entry=0x557071312c50
, max=) at src/task.c:348
#11 0x557070f037da in process_runnable_tasks () at src/task.c:523
#12 0x557070ebdc15 in run_poll_loop () at src/haproxy.c:2900
#13 0x557070ebdfc9 in run_thread_poll_loop (data=) at
src/haproxy.c:3065
#14 0x7ff1cc2f16db in start_thread (arg=0x7ff11f7da700) at
pthread_create.c:463
#15 0x7ff1cb287a3f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

wt., 10 lis 2020 o 17:19 Willy Tarreau  napisał(a):

> On Tue, Nov 10, 2020 at 04:14:52PM +0100, Willy Tarreau wrote:
> > Seems like we're getting closer. Will continue digging now.
>
> I found that among the 5 crashes I got, 3 were under pool_flush()
> that is precisely called during the soft stopping. I tried to
> disable that function with the patch below and I can't reproduce
> the problem anymore, it would be nice if you could test it. I'm
> suspecting that either it copes badly with the lockless pools,
> or that pool_gc() itself, called from the signal handler, could
> possibly damage some of the pools and cause some lose objects to
> be used, returned and reused once reallocated. I see no reason
> for the relation with SPOE like this, but maybe it just helps
> trigger the complex condition.
>
> diff --git a/src/pool.c b/src/pool.c
> index 321f8bc67..5e2f41fe9 100644
> --- a/src/pool.c
> +++ b/src/pool.c
> @@ -246,7 +246,7 @@ void pool_flush(struct pool_head *pool)
> void **next, *temp;
> int removed = 0;
>
> -   if (!pool)
> +   //if (!pool)
> return;
> HA_SPIN_LOCK(POOL_LOCK, >lock);
> do {
>
> I'm continuing to investigate.
>
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
On Tue, Nov 10, 2020 at 04:14:52PM +0100, Willy Tarreau wrote:
> Seems like we're getting closer. Will continue digging now.

I found that among the 5 crashes I got, 3 were under pool_flush()
that is precisely called during the soft stopping. I tried to
disable that function with the patch below and I can't reproduce
the problem anymore, it would be nice if you could test it. I'm
suspecting that either it copes badly with the lockless pools,
or that pool_gc() itself, called from the signal handler, could
possibly damage some of the pools and cause some lose objects to
be used, returned and reused once reallocated. I see no reason
for the relation with SPOE like this, but maybe it just helps
trigger the complex condition.

diff --git a/src/pool.c b/src/pool.c
index 321f8bc67..5e2f41fe9 100644
--- a/src/pool.c
+++ b/src/pool.c
@@ -246,7 +246,7 @@ void pool_flush(struct pool_head *pool)
void **next, *temp;
int removed = 0;
 
-   if (!pool)
+   //if (!pool)
return;
HA_SPIN_LOCK(POOL_LOCK, >lock);
do {

I'm continuing to investigate.

Willy



Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
Hi Maciej,

On Tue, Nov 10, 2020 at 03:21:45PM +0100, Maciej Zdeb wrote:
> Hi,
> 
> I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not
> sufficient for this debugging process.

Quite frankly, you don't have to be sorry for anything :-)

I could reproduce the crash on 2.2 with your procedure, but it happened
not on our canaries but here:

Core was generated by `../haproxy-2.2/haproxy -D -f maciej/haproxy2.cfg -sf 
27670'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  spoe_stop_processing (ctx=0x45def60, agent=0x12059c0) at src/flt_spoe.c:2576
2576if (sa->frag_ctx.ctx == ctx) {
(gdb) p sa
$1 = (struct spoe_appctx *) 0x6b9e220
(gdb) p *sa
Cannot access memory at address 0x6b9e220

(gdb) bt
#0  spoe_stop_processing (ctx=0x45def60, agent=0x12059c0) at src/flt_spoe.c:2576
#1  spoe_process_messages (s=s@entry=0x2822f80, ctx=ctx@entry=0x45def60, 
messages=, dir=dir@entry=0, type=type@entry=1) at 
src/flt_spoe.c:2706
#2  0x005491e3 in spoe_process_event (s=0x2822f80, ctx=0x45def60, 
ev=ev@entry=SPOE_EV_ON_HTTP_REQ_FE) at src/flt_spoe.c:2780
#3  0x0054940a in spoe_chn_pre_analyze (s=, 
filter=, chn=0x2822f90, an_bit=) at 
src/flt_spoe.c:3249
#4  0x00551bd7 in flt_pre_analyze (s=s@entry=0x2822f80, 
chn=chn@entry=0x2822f90, an_bit=an_bit@entry=16) at src/filters.c:718
#5  0x004f1176 in process_stream (t=, context=0x2822f80, 
state=) at src/stream.c:1796
#6  0x005b0533 in run_tasks_from_lists 
(budgets=budgets@entry=0x7ffcc02fa2d4) at src/task.c:479
#7  0x005b101a in process_runnable_tasks () at src/task.c:675
#8  0x0056c194 in run_poll_loop () at src/haproxy.c:2923
#9  0x0056c4d3 in run_thread_poll_loop (data=data@entry=0x0) at 
src/haproxy.c:3088
#10 0x004275a9 in main (argc=, argv=0x7ffcc02fa6e8) at 
src/haproxy.c:3793

Seems like we're getting closer. Will continue digging now.

Many thanks Maciej for all your hard work, really!

Willy



Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi,

I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not
sufficient for this debugging process.

With the patch applied I tried again to use spoa from
"contrib/spoa_example/". Example spoa agent does not understand my
spoe-message and silently ignores it, but it doesn't matter.

To trigger segmentation fault I must reload HAProxy (when using spoa from
an external vendor this additional reload wasn't necessary, I've just had
to wait a couple seconds to trigger crash).

Usually HAProxy crashes on process_stream, but once it crashed at
(long)h2s->subs & 1 check in testcorrupt during spoe_release_appctx
#0  0x5597450c25f9 in testcorrupt (ptr=0x7f4fb8071990) at
src/mux_h2.c:6238
cs = 0x7f4fb8071990
h2s = 0x7f4fe85751f0
#1  0x559745196239 in spoe_release_appctx (appctx=0x7f4fe8324e00) at
src/flt_spoe.c:1294
si = 0x7f4fe82b31f8
spoe_appctx = 0x7f4fe88dd760
agent = 0x559746052580
ctx = 0x7f4fe8380b80
back = 0x559746355b38

Then I tried again to replicate the bug on my r server this time making
HAProxy reloads (multiple times) during the test and it crashed.

HAProxy was compiled with git HEAD set to
77015abe0bcfde67bff519b1d48393a513015f77 with patch
0001-EXP-try-to-spot-where-h2s-subs-changes-V2.patch applied
and with modified h2s:

diff --git a/src/mux_h2.c b/src/mux_h2.c
index 9928b32c7..3d5187271 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -206,6 +206,8 @@ struct h2s {
  uint16_t status; /* HTTP response status */
  unsigned long long body_len; /* remaining body length according to
content-length if H2_SF_DATA_CLEN */
  struct buffer rxbuf; /* receive buffer, always valid (buf_empty or real
buffer) */
+ struct tasklet *dummy0;
+ struct wait_event *dummy1;
  struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
  struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
  struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to send
an RST after we failed to,

Attached:
haproxy.cfg (/etc/haproxy/haproxy.cfg main config)
spoe-example.conf (/etc/haproxy/spoe-example.conf spoe config)

I used spoa from contrib/spoa_example run with command:
"./spoa -p 4545  -c fragmentation -c async -c pipelining"

I used vegeta to generate traffic: https://github.com/tsenart/vegeta with
command:
"cat input | ./vegeta attack -duration=360s -insecure   -keepalive=false
 -http2=true -rate=500/1s > /dev/null"
I used 2 virtual machines to generate traffic and additionally I've
launched vegeta on host with HAProxy

where input file is:
GET https://haproxy-crash.test.local/
zdebek:
sdofijdsoifjodisjfoisdjfoisdovisoivjdfoijvoisdjvopsdijg0934u49032ut09gir09j40g9u0492it093i2g09i0r9bi2490ib094i0b9i09i0924it09biitk42jh09tj4309sdfjdlsjfoadiwe9023i0r92094i4309gi0934ig9034ig093i4g90i3409gi3409gi0394ig0934i0g93jjoujgiurhjgiuerhgiurehgiuerhg89489u098u509u09wrut0923ej23fjjsufdsuf98dusf98u98u2398uf9834uf983u49f8h98huish9fsdu98fusd98uf982u398u3298ru2938uffhsdijhfisdjhiusdhfiu2iuhf2398289823189831893198931udashidsah

The reloaded HAProxy configuration (multiple times, again and again until
segmentation fault occurred):
haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 10608

pon., 9 lis 2020 o 16:01 Maciej Zdeb  napisał(a):

> It crashed now on first test in process_stream:
>
> struct task *process_stream(struct task *t, void *context, unsigned short
> state)
> {
> struct server *srv;
> struct stream *s = context;
> struct session *sess = s->sess;
> unsigned int rqf_last, rpf_last;
> unsigned int rq_prod_last, rq_cons_last;
> unsigned int rp_cons_last, rp_prod_last;
> unsigned int req_ana_back;
> struct channel *req, *res;
> struct stream_interface *si_f, *si_b;
> unsigned int rate;
>
> TEST_STRM(s);
> [...]
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> [Current thread is 1 (Thread 0x7f75a98b9700 (LWP 5860))]
> (gdb) bt full
> #0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
> src/mux_h2.c:6238
> cs = 0x7f75ac1ed990
> h2s = 0x7f7584244510
> #1  0x55f4cdad8993 in process_stream (t=0x7f75ac139d70,
> context=0x7f7588066540, state=260) at src/stream.c:1499
> srv = 0x7f75a9896390
> s = 0x7f7588066540
> sess = 0x7f759c071b80
> rqf_last = 4294967294
> rpf_last = 2217468112
> rq_prod_last = 32629
> rq_cons_last = 2217603024
> rp_cons_last = 32629
> rp_prod_last = 2217182865
> req_ana_back = 2217603025
> req = 0x7f75a9896350
> res = 0x55f4cdbed618 <__task_queue+92>
> si_f = 0x55f4ce03c680 
> si_b = 0x7f75842def80
> rate = 2217603024
> #2  0x55f4cdbeddb2 in run_tasks_from_list (list=0x55f4ce03c6c0
> , max=150) 

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
It crashed now on first test in process_stream:

struct task *process_stream(struct task *t, void *context, unsigned short
state)
{
struct server *srv;
struct stream *s = context;
struct session *sess = s->sess;
unsigned int rqf_last, rpf_last;
unsigned int rq_prod_last, rq_cons_last;
unsigned int rp_cons_last, rp_prod_last;
unsigned int req_ana_back;
struct channel *req, *res;
struct stream_interface *si_f, *si_b;
unsigned int rate;

TEST_STRM(s);
[...]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
src/mux_h2.c:6238
[Current thread is 1 (Thread 0x7f75a98b9700 (LWP 5860))]
(gdb) bt full
#0  0x55f4cda7b5f9 in testcorrupt (ptr=0x7f75ac1ed990) at
src/mux_h2.c:6238
cs = 0x7f75ac1ed990
h2s = 0x7f7584244510
#1  0x55f4cdad8993 in process_stream (t=0x7f75ac139d70,
context=0x7f7588066540, state=260) at src/stream.c:1499
srv = 0x7f75a9896390
s = 0x7f7588066540
sess = 0x7f759c071b80
rqf_last = 4294967294
rpf_last = 2217468112
rq_prod_last = 32629
rq_cons_last = 2217603024
rp_cons_last = 32629
rp_prod_last = 2217182865
req_ana_back = 2217603025
req = 0x7f75a9896350
res = 0x55f4cdbed618 <__task_queue+92>
si_f = 0x55f4ce03c680 
si_b = 0x7f75842def80
rate = 2217603024
#2  0x55f4cdbeddb2 in run_tasks_from_list (list=0x55f4ce03c6c0
, max=150) at src/task.c:371
process = 0x55f4cdad892d 
t = 0x7f75ac139d70
state = 260
ctx = 0x7f7588066540
done = 3
[...]

subs is 0x like before BUT dummy1 is also changed to 0x

(gdb) p *(struct h2s*)(0x7f7584244510)
$1 = {cs = 0x7f75ac1ed990, sess = 0x55f4ce02be40 , h2c =
0x7f758417abd0, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = {
  branches = {b = {0x7f758428e430, 0x7f7584244550}}, node_p =
0x7f758428e431, leaf_p = 0x7f7584244551, bit = 1, pfx = 33828}, key = 23},
id = 23, flags = 16385, sws = 0, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
  status = 0, body_len = 0, rxbuf = {size = 16384, area = 0x7f75780a2210
"Ð?", data = 16384, head = 0}, dummy0 = 0x0, dummy1 = 0x, subs =
0x, list = {n = 0x7f75842445c8, p = 0x7f75842445c8},
  shut_tl = 0x7f75842df0d0}

pon., 9 lis 2020 o 15:07 Christopher Faulet 
napisał(a):

> Le 09/11/2020 à 13:10, Maciej Zdeb a écrit :
> > I've played little bit with the patch and it led me to backend.c file
> and
> > connect_server() function
> >
> > int connect_server(struct stream *s)
> > {
> > [...]
> > if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
> >  /* set the correct protocol on the output stream
> interface */
> >  if (srv)
> >  conn_prepare(srv_conn,
> > protocol_by_family(srv_conn->dst->ss_family), srv->xprt);
> >  else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
> >  /* proxies exclusively run on raw_sock right
> now */
> >  conn_prepare(srv_conn,
> > protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));
> >  if (!(srv_conn->ctrl)) {
> >  conn_free(srv_conn);
> >  return SF_ERR_INTERNAL;
> >  }
> >  }
> >  else {
> >  conn_free(srv_conn);
> >  return SF_ERR_INTERNAL;  /* how did we get
> there ? */
> >  }
> > // THIS ONE IS OK
> > TEST_STRM(s);
> > //
> >  srv_cs = si_alloc_cs(>si[1], srv_conn);
> > // FAIL
> > TEST_STRM(s);
> > //
> >  if (!srv_cs) {
> >  conn_free(srv_conn);
> >  return SF_ERR_RESOURCE;
> >  }
>
> Hi,
>
> In fact, this crash occurs because of the Willy's patch. It was not design
> to
> handle non-h2 connections. Here the crash happens on a TCP connection,
> used by a
> SPOE applet for instance.
>
> I updated its patch. First, I added some calls to TEST_STRM() in the SPOE
> code,
> to be sure. I also explicitly set the stream task to NULL in stream_free()
> to
> catch late wakeups in the SPOE. Finally, I modified testcorrupt(). I hope
> this
> one is correct. But if I missed something, you may only keep the last
> ABORT_NOW() in testcorrupt() and replace others by a return statement,
> just like
> in the Willy's patch.
>
> --
> Christopher Faulet
>


Re: [2.0.17] crash with coredump

2020-11-09 Thread Christopher Faulet

Le 09/11/2020 à 13:10, Maciej Zdeb a écrit :
I've played little bit with the patch and it led me to backend.c file and 
connect_server() function


int connect_server(struct stream *s)
{
[...]
if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
                 /* set the correct protocol on the output stream interface */
                 if (srv)
                         conn_prepare(srv_conn, 
protocol_by_family(srv_conn->dst->ss_family), srv->xprt);

                 else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
                         /* proxies exclusively run on raw_sock right now */
                         conn_prepare(srv_conn, 
protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));

                         if (!(srv_conn->ctrl)) {
                                 conn_free(srv_conn);
                                 return SF_ERR_INTERNAL;
                         }
                 }
                 else {
                         conn_free(srv_conn);
                         return SF_ERR_INTERNAL;  /* how did we get there ? */
                 }
// THIS ONE IS OK
TEST_STRM(s);
//
                 srv_cs = si_alloc_cs(>si[1], srv_conn);
// FAIL
TEST_STRM(s);
//
                 if (!srv_cs) {
                         conn_free(srv_conn);
                         return SF_ERR_RESOURCE;
                 }


Hi,

In fact, this crash occurs because of the Willy's patch. It was not design to 
handle non-h2 connections. Here the crash happens on a TCP connection, used by a 
SPOE applet for instance.


I updated its patch. First, I added some calls to TEST_STRM() in the SPOE code, 
to be sure. I also explicitly set the stream task to NULL in stream_free() to 
catch late wakeups in the SPOE. Finally, I modified testcorrupt(). I hope this 
one is correct. But if I missed something, you may only keep the last 
ABORT_NOW() in testcorrupt() and replace others by a return statement, just like 
in the Willy's patch.


--
Christopher Faulet
>From ba99e0eedf1730970f1d0b5bb67f24ef79117207 Mon Sep 17 00:00:00 2001
From: Christopher Faulet 
Date: Mon, 9 Nov 2020 14:37:57 +0100
Subject: [PATCH] EXP: try to spot where h2s->subs changes

---
 include/haproxy/bug.h |  7 ++
 src/flt_spoe.c|  8 +++
 src/mux_h2.c  | 25 
 src/stream.c  | 55 +++
 4 files changed, 95 insertions(+)

diff --git a/include/haproxy/bug.h b/include/haproxy/bug.h
index a008126f5c..c650f60b8c 100644
--- a/include/haproxy/bug.h
+++ b/include/haproxy/bug.h
@@ -166,6 +166,13 @@ struct mem_stats {
 })
 #endif /* DEBUG_MEM_STATS*/
 
+
+#define TEST_CS(ptr) do { extern void testcorrupt(const void *); testcorrupt(ptr); } while (0)
+
+#define TEST_SI(si) do { if ((si)) TEST_CS((si)->end); } while (0)
+
+#define TEST_STRM(s) do { if ((s)) { TEST_SI(&(s)->si[0]); TEST_SI(&(s)->si[1]);} } while (0)
+
 #endif /* _HAPROXY_BUG_H */
 
 /*
diff --git a/src/flt_spoe.c b/src/flt_spoe.c
index cf5fc7a4c0..6899b16e66 100644
--- a/src/flt_spoe.c
+++ b/src/flt_spoe.c
@@ -1255,6 +1255,7 @@ spoe_release_appctx(struct appctx *appctx)
 		spoe_update_stat_time(>stats.tv_wait, >stats.t_waiting);
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
+		TEST_STRM(ctx->strm);
 		task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	}
 
@@ -1265,6 +1266,7 @@ spoe_release_appctx(struct appctx *appctx)
 		ctx->spoe_appctx = NULL;
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
+		TEST_STRM(ctx->strm);
 		task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	}
 
@@ -1279,6 +1281,7 @@ spoe_release_appctx(struct appctx *appctx)
 		spoe_update_stat_time(>stats.tv_queue, >stats.t_queue);
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
+		TEST_STRM(ctx->strm);
 		task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	}
 	list_for_each_entry_safe(ctx, back, >rt[tid].waiting_queue, list) {
@@ -1288,6 +1291,7 @@ spoe_release_appctx(struct appctx *appctx)
 		spoe_update_stat_time(>stats.tv_wait, >stats.t_waiting);
 		ctx->state = SPOE_CTX_ST_ERROR;
 		ctx->status_code = (spoe_appctx->status_code + 0x100);
+		TEST_STRM(ctx->strm);
 		task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	}
 
@@ -1491,6 +1495,7 @@ spoe_handle_sending_frame_appctx(struct appctx *appctx, int *skip)
 			ctx->spoe_appctx = NULL;
 			ctx->state = SPOE_CTX_ST_ERROR;
 			ctx->status_code = (SPOE_APPCTX(appctx)->status_code + 0x100);
+			TEST_STRM(ctx->strm);
 			task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 			*skip = 1;
 			break;
@@ -1524,6 +1529,7 @@ spoe_handle_sending_frame_appctx(struct appctx *appctx, int *skip)
 	SPOE_APPCTX(appctx)->frag_ctx.cursid = ctx->stream_id;
 	SPOE_APPCTX(appctx)->frag_ctx.curfid = ctx->frame_id;
 	ctx->state = SPOE_CTX_ST_ENCODING_MSGS;
+	TEST_STRM(ctx->strm);
 	task_wakeup(ctx->strm->task, TASK_WOKEN_MSG);
 	goto 

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
I've played little bit with the patch and it led me to backend.c file and
connect_server() function

int connect_server(struct stream *s)
{
[...]
if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {
/* set the correct protocol on the output stream interface
*/
if (srv)
conn_prepare(srv_conn,
protocol_by_family(srv_conn->dst->ss_family), srv->xprt);
else if (obj_type(s->target) == OBJ_TYPE_PROXY) {
/* proxies exclusively run on raw_sock right now */
conn_prepare(srv_conn,
protocol_by_family(srv_conn->dst->ss_family), xprt_get(XPRT_RAW));
if (!(srv_conn->ctrl)) {
conn_free(srv_conn);
return SF_ERR_INTERNAL;
}
}
else {
conn_free(srv_conn);
return SF_ERR_INTERNAL;  /* how did we get there ?
*/
}
// THIS ONE IS OK
TEST_STRM(s);
//
srv_cs = si_alloc_cs(>si[1], srv_conn);
// FAIL
TEST_STRM(s);
//
if (!srv_cs) {
conn_free(srv_conn);
return SF_ERR_RESOURCE;
}
[...]
}

pon., 9 lis 2020 o 11:51 Maciej Zdeb  napisał(a):

> Hi,
>
> This time h2s = 0x30 ;)
>
> it crashed here:
> void testcorrupt(void *ptr)
> {
> [...]
> if (h2s->cs != cs)
> return;
> [...]
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
> src/mux_h2.c:6228
> 6228 src/mux_h2.c: No such file or directory.
> [Current thread is 1 (Thread 0x7f99a484d700 (LWP 28658))]
> (gdb) bt full
> #0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
> src/mux_h2.c:6228
> cs = 0x7f99741d85a0
> h2s = 0x30
> #1  0x556b61850b1a in process_stream (t=0x7f99741d8c60,
> context=0x7f99682cd7b0, state=1284) at src/stream.c:2147
> srv = 0x556b622770e0
> s = 0x7f99682cd7b0
> sess = 0x7f9998057170
> rqf_last = 9469954
> rpf_last = 2151677952
> rq_prod_last = 8
> rq_cons_last = 0
> rp_cons_last = 8
> rp_prod_last = 0
> req_ana_back = 0
> req = 0x7f99682cd7c0
> res = 0x7f99682cd820
> si_f = 0x7f99682cdae8
> si_b = 0x7f99682cdb40
> rate = 1
> #2  0x556b61962a5f in run_tasks_from_list (list=0x556b61db1600
> , max=150) at src/task.c:371
> process = 0x556b6184d8e6 
> t = 0x7f99741d8c60
> state = 1284
> ctx = 0x7f99682cd7b0
> done = 2
> [...]
>
>
> pt., 6 lis 2020 o 20:00 Willy Tarreau  napisał(a):
>
>> Maciej,
>>
>> I wrote this ugly patch to try to crash as soon as possible when a corrupt
>> h2s->subs is detected. The patch was written for 2.2. I only instrumented
>> roughly 30 places in process_stream() which is a fairly likely candidate.
>> I just hope it happens within the context of the stream itself otherwise
>> it will become really painful.
>>
>> You can apply this patch on top of your existing changes. It will try to
>> detect the presence of a non-zero lowest bit in the subs pointer (which
>> should never happen). If we're lucky it will crash inside process_stream()
>> between two points and we'll be able to narrow it down. If we're unlucky
>> it will crash when entering it and that will not be fun.
>>
>> If you want to play with it, you can apply TEST_SI() on stream_interface
>> pointers (often called "si"), TEST_STRM() on stream pointers, and
>> TEST_CS()
>> on conn_stream pointers (often called "cs").
>>
>> Please just let me know how it goes. Note, I tested it, it passes all
>> regtests for me so I'm reasonably confident it should not crash by
>> accident. But I can't be sure, I'm just using heuristics, so please do
>> not put it in sensitive production!
>>
>> Thanks,
>> Willy
>>
>


Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
Hi,

This time h2s = 0x30 ;)

it crashed here:
void testcorrupt(void *ptr)
{
[...]
if (h2s->cs != cs)
return;
[...]

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
src/mux_h2.c:6228
6228 src/mux_h2.c: No such file or directory.
[Current thread is 1 (Thread 0x7f99a484d700 (LWP 28658))]
(gdb) bt full
#0  0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at
src/mux_h2.c:6228
cs = 0x7f99741d85a0
h2s = 0x30
#1  0x556b61850b1a in process_stream (t=0x7f99741d8c60,
context=0x7f99682cd7b0, state=1284) at src/stream.c:2147
srv = 0x556b622770e0
s = 0x7f99682cd7b0
sess = 0x7f9998057170
rqf_last = 9469954
rpf_last = 2151677952
rq_prod_last = 8
rq_cons_last = 0
rp_cons_last = 8
rp_prod_last = 0
req_ana_back = 0
req = 0x7f99682cd7c0
res = 0x7f99682cd820
si_f = 0x7f99682cdae8
si_b = 0x7f99682cdb40
rate = 1
#2  0x556b61962a5f in run_tasks_from_list (list=0x556b61db1600
, max=150) at src/task.c:371
process = 0x556b6184d8e6 
t = 0x7f99741d8c60
state = 1284
ctx = 0x7f99682cd7b0
done = 2
[...]


pt., 6 lis 2020 o 20:00 Willy Tarreau  napisał(a):

> Maciej,
>
> I wrote this ugly patch to try to crash as soon as possible when a corrupt
> h2s->subs is detected. The patch was written for 2.2. I only instrumented
> roughly 30 places in process_stream() which is a fairly likely candidate.
> I just hope it happens within the context of the stream itself otherwise
> it will become really painful.
>
> You can apply this patch on top of your existing changes. It will try to
> detect the presence of a non-zero lowest bit in the subs pointer (which
> should never happen). If we're lucky it will crash inside process_stream()
> between two points and we'll be able to narrow it down. If we're unlucky
> it will crash when entering it and that will not be fun.
>
> If you want to play with it, you can apply TEST_SI() on stream_interface
> pointers (often called "si"), TEST_STRM() on stream pointers, and TEST_CS()
> on conn_stream pointers (often called "cs").
>
> Please just let me know how it goes. Note, I tested it, it passes all
> regtests for me so I'm reasonably confident it should not crash by
> accident. But I can't be sure, I'm just using heuristics, so please do
> not put it in sensitive production!
>
> Thanks,
> Willy
>


Re: [2.0.17] crash with coredump

2020-11-06 Thread Willy Tarreau
Maciej,

I wrote this ugly patch to try to crash as soon as possible when a corrupt
h2s->subs is detected. The patch was written for 2.2. I only instrumented
roughly 30 places in process_stream() which is a fairly likely candidate.
I just hope it happens within the context of the stream itself otherwise
it will become really painful.

You can apply this patch on top of your existing changes. It will try to
detect the presence of a non-zero lowest bit in the subs pointer (which
should never happen). If we're lucky it will crash inside process_stream()
between two points and we'll be able to narrow it down. If we're unlucky
it will crash when entering it and that will not be fun.

If you want to play with it, you can apply TEST_SI() on stream_interface
pointers (often called "si"), TEST_STRM() on stream pointers, and TEST_CS()
on conn_stream pointers (often called "cs").

Please just let me know how it goes. Note, I tested it, it passes all
regtests for me so I'm reasonably confident it should not crash by
accident. But I can't be sure, I'm just using heuristics, so please do
not put it in sensitive production!

Thanks,
Willy
>From b7638769b3ee38a23bf319df5338c0ba46d9f57e Mon Sep 17 00:00:00 2001
From: Willy Tarreau 
Date: Fri, 6 Nov 2020 19:54:01 +0100
Subject: EXP: try to spot where h2s->subs changes

---
 include/haproxy/bug.h |  7 +++
 src/mux_h2.c  | 22 +
 src/stream.c  | 54 +++
 3 files changed, 83 insertions(+)

diff --git a/include/haproxy/bug.h b/include/haproxy/bug.h
index a008126..c650f60 100644
--- a/include/haproxy/bug.h
+++ b/include/haproxy/bug.h
@@ -166,6 +166,13 @@ struct mem_stats {
 })
 #endif /* DEBUG_MEM_STATS*/
 
+
+#define TEST_CS(ptr) do { extern void testcorrupt(const void *); 
testcorrupt(ptr); } while (0)
+
+#define TEST_SI(si) do { if ((si)) TEST_CS((si)->end); } while (0)
+
+#define TEST_STRM(s) do { if ((s)) { TEST_SI(&(s)->si[0]); 
TEST_SI(&(s)->si[1]);} } while (0)
+
 #endif /* _HAPROXY_BUG_H */
 
 /*
diff --git a/src/mux_h2.c b/src/mux_h2.c
index 5830fdb..6b5a649 100644
--- a/src/mux_h2.c
+++ b/src/mux_h2.c
@@ -6251,3 +6251,25 @@ static int init_h2()
 }
 
 REGISTER_POST_CHECK(init_h2);
+
+void testcorrupt(void *ptr)
+{
+   const struct conn_stream *cs = objt_cs(ptr);
+   const struct h2s *h2s;
+
+   if (!cs)
+   return;
+
+   h2s = cs->ctx;
+   if (!h2s)
+   return;
+
+   if (h2s->cs != cs)
+   return;
+
+   if (!h2s->h2c || !h2s->h2c->conn || h2s->h2c->conn->mux != _ops)
+   return;
+
+   if ((long)h2s->subs & 1)
+   ABORT_NOW();
+}
diff --git a/src/stream.c b/src/stream.c
index 43f1432..6646d1a 100644
--- a/src/stream.c
+++ b/src/stream.c
@@ -531,6 +531,7 @@ struct stream *stream_new(struct session *sess, enum 
obj_type *origin)
 * the caller must handle the task_wakeup
 */
DBG_TRACE_LEAVE(STRM_EV_STRM_NEW, s);
+   TEST_STRM(s);
return s;
 
/* Error unrolling */
@@ -542,6 +543,7 @@ struct stream *stream_new(struct session *sess, enum 
obj_type *origin)
 out_fail_alloc_si1:
tasklet_free(s->si[0].wait_event.tasklet);
  out_fail_alloc:
+   TEST_STRM(s);
pool_free(pool_head_stream, s);
DBG_TRACE_DEVEL("leaving on error", STRM_EV_STRM_NEW|STRM_EV_STRM_ERR);
return NULL;
@@ -1497,6 +1499,8 @@ struct task *process_stream(struct task *t, void 
*context, unsigned short state)
struct stream_interface *si_f, *si_b;
unsigned int rate;
 
+   TEST_STRM(s);
+
DBG_TRACE_ENTER(STRM_EV_STRM_PROC, s);
 
activity[tid].stream_calls++;
@@ -1594,6 +1598,8 @@ struct task *process_stream(struct task *t, void 
*context, unsigned short state)
}
 
  resync_stream_interface:
+   TEST_STRM(s);
+
/* below we may emit error messages so we have to ensure that we have
 * our buffers properly allocated.
 */
@@ -1658,6 +1664,8 @@ struct task *process_stream(struct task *t, void 
*context, unsigned short state)
/* note: maybe we should process connection errors here ? */
}
 
+   TEST_STRM(s);
+
if (si_state_in(si_b->state, SI_SB_CON|SI_SB_RDY)) {
/* we were trying to establish a connection on the server side,
 * maybe it succeeded, maybe it failed, maybe we timed out, ...
@@ -1677,6 +1685,8 @@ struct task *process_stream(struct task *t, void 
*context, unsigned short state)
 * SI_ST_ASS/SI_ST_TAR/SI_ST_REQ for retryable errors.
 */
}
+   TEST_STRM(s);
+
 
rq_prod_last = si_f->state;
rq_cons_last = si_b->state;
@@ -1707,12 +1717,16 @@ struct task *process_stream(struct task *t, void 
*context, unsigned short state)
}
}
 
+   TEST_STRM(s);
+
/*
 * Note: of the transient states (REQ, CER, DIS), only REQ may remain

Re: [2.0.17] crash with coredump

2020-11-06 Thread Willy Tarreau
Hi Kirill,

On Fri, Nov 06, 2020 at 06:41:03PM +0100, Kirill A. Korinsky wrote:
> Hey,
> 
> I'm wondering, does it related to this code:
> 
> +   /* some tasks may have woken other ones up */
> +   if (max_processed && thread_has_tasks())
> +   goto not_done_yet;
> +
(...)
> as far as I understand it should be safe to remove (with not_done_yet label).
> 
> Can you try it?

It's indeed absolutely safe to remove, but it will not tell us anything
unfortunately. If the problem disappears or appears with/without it, it
will just further confirm that the problem has likely been there for
even longer and is sensitive to the sequencing.

I really wish I could have a way to reproduce it. I'd instrument the code to
crash as soon as we'd detect the corruption, and try to narrow down the area
where it happens till we find the offending code.

If someone else faces the same issue and figures a reliable way to reproduce
it, please suggest!

Cheers,
Willy



Re: [2.0.17] crash with coredump

2020-11-06 Thread Kirill A. Korinsky
Hey,

I'm wondering, does it related to this code:

+   /* some tasks may have woken other ones up */
+   if (max_processed && thread_has_tasks())
+   goto not_done_yet;
+

from 
http://git.haproxy.org/?p=haproxy-2.2.git;a=blobdiff;f=src/task.c;h=500223f185bf324c0adb34a42ec0244e638ce63e;hp=1a7f44d9169e0a01d42ba13d8d335102aa43577b;hb=5c8be272c732e4f42ccd6b3d65f25aa7425a2aba;hpb=77015abe0bcfde67bff519b1d48393a513015f77
 


as far as I understand it should be safe to remove (with not_done_yet label).

Can you try it?

--
wbr, Kirill

> On 3. Nov 2020, at 15:15, Maciej Zdeb  wrote:
> 
> I modified h2s struct in 2.2 branch with HEAD set to 
> f96508aae6b49277dcf142caa35042678cf8e2ca "MEDIUM: mux-h2: merge recv_wait and 
> send_wait event notifications" like below (subs is in exact place of removed 
> wait_event):
> 
> struct h2s {
> [...]
> struct tasklet *dummy0;
> struct wait_event *dummy1;
> struct wait_event *subs;  /* recv wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or 
> h2c->fctl_lsit */
> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to 
> send an RST after we failed to,
>* in case there's no other subscription to 
> do it */
> }
> 
> it crashed like before with subs = 0x:
> 
> (gdb) p *(struct h2s*)(0x7fde7459e9b0)
> $1 = {cs = 0x7fde5c02d260, sess = 0x5628283bc740 , h2c = 
> 0x5628295cbb80, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
> body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = 
> {branches = {b = {0x0, 0x7fde3c2c6c60}}, node_p = 0x0,
>   leaf_p = 0x5628295cc018, bit = -5624, pfx = 29785}, key = 11}, id = 11, 
> flags = 28673, sws = -4060, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
>   status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 
> 0}, dummy0 = 0x0, dummy1 = 0x0, subs = 0x, list = {
> n = 0x7fde7459ea68, p = 0x7fde7459ea68}, shut_tl = 0x5628297eeaf0}
> 
> it crashes like above until commit: 
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=5c8be272c732e4f42ccd6b3d65f25aa7425a2aba
>  
> 
>  which alters tasks processing.
> 
> 
> pon., 2 lis 2020 o 15:46 Maciej Zdeb mailto:mac...@zdeb.pl>> 
> napisał(a):
> I'm wondering, the corrupted address was always at "wait_event" in h2s 
> struct, after its removal in: 
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>  
> 
>  crashes went away.
> 
> But with the above patch and after altering h2s struct into:
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or 
> h2c->fctl_lsit */
> };
> 
> the crash returned.
> 
> However after recv_wait and send_wait were merged in: 
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
>  
> 
>  crashes went away again.
> 
> In my opinion shut_tl should be corrupted again, but it is not. Maybe the 
> last patch fixed it?
> 
> pon., 2 lis 2020 o 15:37 Kirill A. Korinsky  > napisał(a):
> Maciej,
> 
> Looks like memory corruption is still here but it corrupt just some another 
> place.
> 
> Willy do you agree?
> 
> --
> wbr, Kirill
> 
>> On 2. Nov 2020, at 15:34, Maciej Zdeb > > wrote:
>> 
>> So after Kirill suggestion to modify h2s struct in a way that tasklet 
>> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will 
>> occur nd it did not!
>> 
>> After the patch that merges recv_wait and send_wait: 
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
>>  
>> 
>> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes are 
>> gone:
>> 
>> struct h2s {
>> [...]
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or 

Re: [2.0.17] crash with coredump

2020-11-03 Thread Maciej Zdeb
I modified h2s struct in 2.2 branch with HEAD set to
f96508aae6b49277dcf142caa35042678cf8e2ca "MEDIUM: mux-h2: merge recv_wait
and send_wait event notifications" like below (subs is in exact place of
removed wait_event):

struct h2s {
[...]
struct tasklet *dummy0;
struct wait_event *dummy1;
struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to
send an RST after we failed to,
   * in case there's no other subscription
to do it */
}

it crashed like before with subs = 0x:

(gdb) p *(struct h2s*)(0x7fde7459e9b0)
$1 = {cs = 0x7fde5c02d260, sess = 0x5628283bc740 , h2c =
0x5628295cbb80, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0,
body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node =
{branches = {b = {0x0, 0x7fde3c2c6c60}}, node_p = 0x0,
  leaf_p = 0x5628295cc018, bit = -5624, pfx = 29785}, key = 11}, id =
11, flags = 28673, sws = -4060, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM,
  status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head
= 0}, dummy0 = 0x0, dummy1 = 0x0, subs = 0x, list = {
n = 0x7fde7459ea68, p = 0x7fde7459ea68}, shut_tl = 0x5628297eeaf0}

it crashes like above until commit:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=5c8be272c732e4f42ccd6b3d65f25aa7425a2aba
which alters tasks processing.


pon., 2 lis 2020 o 15:46 Maciej Zdeb  napisał(a):

> I'm wondering, the corrupted address was always at "wait_event" in h2s
> struct, after its removal in:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
> crashes went away.
>
> But with the above patch and after altering h2s struct into:
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
> the crash returned.
>
> However after recv_wait and send_wait were merged in:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
> crashes went away again.
>
> In my opinion shut_tl should be corrupted again, but it is not. Maybe the
> last patch fixed it?
>
> pon., 2 lis 2020 o 15:37 Kirill A. Korinsky  napisał(a):
>
>> Maciej,
>>
>> Looks like memory corruption is still here but it corrupt just some
>> another place.
>>
>> Willy do you agree?
>>
>> --
>> wbr, Kirill
>>
>> On 2. Nov 2020, at 15:34, Maciej Zdeb  wrote:
>>
>> So after Kirill suggestion to modify h2s struct in a way that tasklet
>> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
>> occur nd it did not!
>>
>> After the patch that merges recv_wait and send_wait:
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
>> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes
>> are gone:
>>
>> struct h2s {
>> [...]
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>> or real buffer) */
>> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry
>> to send an RST after we failed to,
>>* in case there's no other
>> subscription to do it */
>> struct wait_event *subs;  /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> };
>>
>>
>>
>> pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):
>>
>>> Great idea Kirill,
>>>
>>> With such modification:
>>>
>>> struct h2s {
>>> [...]
>>> struct tasklet *shut_tl;
>>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> struct wait_event *send_wait; /* send wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> struct list list; /* To be used when adding in h2c->send_list or
>>> h2c->fctl_lsit */
>>> };
>>>
>>> it crashed just like before.
>>>
>>> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky 
>>> napisał(a):
>>>
 Hi,

 Thanks for update.

 After read Wully's recommendation and provided commit that fixed an
 issue I'm curious can you "edit" a bit this commit and move `shut_tl`
 before `recv_wait` instead of removed `wait_event`?

 It is a quiet dummy way to confirm that memory corruption had gone, and
 not just moved to somewhere else.

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
I'm wondering, the corrupted address was always at "wait_event" in h2s
struct, after its removal in:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
crashes went away.

But with the above patch and after altering h2s struct into:
struct h2s {
[...]
struct tasklet *shut_tl;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};

the crash returned.

However after recv_wait and send_wait were merged in:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
crashes went away again.

In my opinion shut_tl should be corrupted again, but it is not. Maybe the
last patch fixed it?

pon., 2 lis 2020 o 15:37 Kirill A. Korinsky  napisał(a):

> Maciej,
>
> Looks like memory corruption is still here but it corrupt just some
> another place.
>
> Willy do you agree?
>
> --
> wbr, Kirill
>
> On 2. Nov 2020, at 15:34, Maciej Zdeb  wrote:
>
> So after Kirill suggestion to modify h2s struct in a way that tasklet
> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
> occur nd it did not!
>
> After the patch that merges recv_wait and send_wait:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes
> are gone:
>
> struct h2s {
> [...]
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
> real buffer) */
> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry
> to send an RST after we failed to,
>* in case there's no other subscription
> to do it */
> struct wait_event *subs;  /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
>
>
> pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):
>
>> Great idea Kirill,
>>
>> With such modification:
>>
>> struct h2s {
>> [...]
>> struct tasklet *shut_tl;
>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct wait_event *send_wait; /* send wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> };
>>
>> it crashed just like before.
>>
>> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky 
>> napisał(a):
>>
>>> Hi,
>>>
>>> Thanks for update.
>>>
>>> After read Wully's recommendation and provided commit that fixed an
>>> issue I'm curious can you "edit" a bit this commit and move `shut_tl`
>>> before `recv_wait` instead of removed `wait_event`?
>>>
>>> It is a quiet dummy way to confirm that memory corruption had gone, and
>>> not just moved to somewhere else.
>>>
>>> --
>>> wbr, Kirill
>>>
>>> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>>>
>>> Hi,
>>>
>>> Update for people on the list that might be interested in the issue,
>>> because part of discussion was private.
>>>
>>> I wanted to check Willy suggestion and modified h2s struct (added dummy
>>> fields):
>>>
>>> struct h2s {
>>> [...]
>>> uint16_t status; /* HTTP response status */
>>> unsigned long long body_len; /* remaining body length according
>>> to content-length if H2_SF_DATA_CLEN */
>>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>>> or real buffer) */
>>> int dummy0;
>>> struct wait_event wait_event; /* Wait list, when we're
>>> attempting to send a RST but we can't send */
>>> int dummy1;
>>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> int dummy2;
>>> struct wait_event *send_wait; /* send wait_event the conn_stream
>>> associated is waiting on (via h2_subscribe) */
>>> int dummy3;
>>> struct list list; /* To be used when adding in h2c->send_list or
>>> h2c->fctl_lsit */
>>> struct list sending_list; /* To be used when adding in
>>> h2c->sending_list */
>>> };
>>>
>>> With such modified h2s struct, the crash did not occur.
>>>
>>> I've checked HAProxy 2.1, it crashes like 2.0.
>>>
>>> I've also checked 2.2, bisection showed that this commit:
>>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>>> fixed the crashes we experienced. I'm not sure how/if it fixed the memory
>>> corruption, it is possible that memory is still corrupted 

Re: [2.0.17] crash with coredump

2020-11-02 Thread Kirill A. Korinsky
Maciej,

Looks like memory corruption is still here but it corrupt just some another 
place.

Willy do you agree?

--
wbr, Kirill

> On 2. Nov 2020, at 15:34, Maciej Zdeb  wrote:
> 
> So after Kirill suggestion to modify h2s struct in a way that tasklet 
> "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will 
> occur nd it did not!
> 
> After the patch that merges recv_wait and send_wait: 
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
>  
> 
> and witch such h2s (tasklet shut_tl before wait_event subs) the crashes are 
> gone:
> 
> struct h2s {
> [...]
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or 
> real buffer) */
> struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to 
> send an RST after we failed to,
>* in case there's no other subscription to 
> do it */
> struct wait_event *subs;  /* recv wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or 
> h2c->fctl_lsit */
> };
> 
> 
> 
> pon., 2 lis 2020 o 12:42 Maciej Zdeb mailto:mac...@zdeb.pl>> 
> napisał(a):
> Great idea Kirill,
> 
> With such modification:
> 
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or 
> h2c->fctl_lsit */
> };
> 
> it crashed just like before.
> 
> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky  > napisał(a):
> Hi,
> 
> Thanks for update.
> 
> After read Wully's recommendation and provided commit that fixed an issue I'm 
> curious can you "edit" a bit this commit and move `shut_tl` before 
> `recv_wait` instead of removed `wait_event`?
> 
> It is a quiet dummy way to confirm that memory corruption had gone, and not 
> just moved to somewhere else.
> 
> --
> wbr, Kirill
> 
>> On 2. Nov 2020, at 10:58, Maciej Zdeb > > wrote:
>> 
>> Hi,
>> 
>> Update for people on the list that might be interested in the issue, because 
>> part of discussion was private.
>> 
>> I wanted to check Willy suggestion and modified h2s struct (added dummy 
>> fields):
>> 
>> struct h2s {
>> [...]
>> uint16_t status; /* HTTP response status */
>> unsigned long long body_len; /* remaining body length according to 
>> content-length if H2_SF_DATA_CLEN */
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or 
>> real buffer) */
>> int dummy0;
>> struct wait_event wait_event; /* Wait list, when we're attempting to 
>> send a RST but we can't send */
>> int dummy1;
>> struct wait_event *recv_wait; /* recv wait_event the conn_stream 
>> associated is waiting on (via h2_subscribe) */
>> int dummy2;
>> struct wait_event *send_wait; /* send wait_event the conn_stream 
>> associated is waiting on (via h2_subscribe) */
>> int dummy3;
>> struct list list; /* To be used when adding in h2c->send_list or 
>> h2c->fctl_lsit */
>> struct list sending_list; /* To be used when adding in 
>> h2c->sending_list */
>> };
>> 
>> With such modified h2s struct, the crash did not occur.
>> 
>> I've checked HAProxy 2.1, it crashes like 2.0.
>> 
>> I've also checked 2.2, bisection showed that this commit: 
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>>  
>> 
>>  fixed the crashes we experienced. I'm not sure how/if it fixed the memory 
>> corruption, it is possible that memory is still corrupted but not causing 
>> the crash.
>> 
>> 
>> 
>> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky > > napisał(a):
>> Very interesting.
>> 
>> Anyway, I can see that this pice of code was refactored some time ago: 
>> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>>  
>> 
>> 
>> Maybe it is worth to try 2.2 or 2.3 branch?
>> 
>> Yes, it is a blind shot and just a guess.
>> 
>> --
>> wbr, Kirill
>> 
>>> On 25. Sep 2020, at 16:01, Maciej Zdeb >> > wrote:
>>> 
>>> Yes at the same place with same value:
>>> 
>>> (gdb) bt full
>>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at 
>>> src/mux_h2.c:783
>>> sw = 

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
So after Kirill suggestion to modify h2s struct in a way that tasklet
"shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will
occur nd it did not!

After the patch that merges recv_wait and send_wait:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commit;h=f96508aae6b49277dcf142caa35042678cf8e2ca
and witch such h2s (tasklet shut_tl before wait_event subs) the crashes are
gone:

struct h2s {
[...]
struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
real buffer) */
struct tasklet *shut_tl;  /* deferred shutdown tasklet, to retry to
send an RST after we failed to,
   * in case there's no other subscription
to do it */
struct wait_event *subs;  /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};



pon., 2 lis 2020 o 12:42 Maciej Zdeb  napisał(a):

> Great idea Kirill,
>
> With such modification:
>
> struct h2s {
> [...]
> struct tasklet *shut_tl;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> };
>
> it crashed just like before.
>
> pon., 2 lis 2020 o 11:12 Kirill A. Korinsky  napisał(a):
>
>> Hi,
>>
>> Thanks for update.
>>
>> After read Wully's recommendation and provided commit that fixed an issue
>> I'm curious can you "edit" a bit this commit and move `shut_tl` before
>> `recv_wait` instead of removed `wait_event`?
>>
>> It is a quiet dummy way to confirm that memory corruption had gone, and
>> not just moved to somewhere else.
>>
>> --
>> wbr, Kirill
>>
>> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>>
>> Hi,
>>
>> Update for people on the list that might be interested in the issue,
>> because part of discussion was private.
>>
>> I wanted to check Willy suggestion and modified h2s struct (added dummy
>> fields):
>>
>> struct h2s {
>> [...]
>> uint16_t status; /* HTTP response status */
>> unsigned long long body_len; /* remaining body length according
>> to content-length if H2_SF_DATA_CLEN */
>> struct buffer rxbuf; /* receive buffer, always valid (buf_empty
>> or real buffer) */
>> int dummy0;
>> struct wait_event wait_event; /* Wait list, when we're attempting
>> to send a RST but we can't send */
>> int dummy1;
>> struct wait_event *recv_wait; /* recv wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> int dummy2;
>> struct wait_event *send_wait; /* send wait_event the conn_stream
>> associated is waiting on (via h2_subscribe) */
>> int dummy3;
>> struct list list; /* To be used when adding in h2c->send_list or
>> h2c->fctl_lsit */
>> struct list sending_list; /* To be used when adding in
>> h2c->sending_list */
>> };
>>
>> With such modified h2s struct, the crash did not occur.
>>
>> I've checked HAProxy 2.1, it crashes like 2.0.
>>
>> I've also checked 2.2, bisection showed that this commit:
>> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>> fixed the crashes we experienced. I'm not sure how/if it fixed the memory
>> corruption, it is possible that memory is still corrupted but not causing
>> the crash.
>>
>>
>>
>> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky 
>> napisał(a):
>>
>>> Very interesting.
>>>
>>> Anyway, I can see that this pice of code was refactored some time ago:
>>> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>>>
>>> Maybe it is worth to try 2.2 or 2.3 branch?
>>>
>>> Yes, it is a blind shot and just a guess.
>>>
>>> --
>>> wbr, Kirill
>>>
>>> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>>>
>>> Yes at the same place with same value:
>>>
>>> (gdb) bt full
>>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
>>> src/mux_h2.c:783
>>> sw = 0x
>>>
>>>
>>>
>>> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky 
>>> napisał(a):
>>>
 > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
 >
 > I was mailing outside the list with Willy and Christopher but it's
 worth sharing that the problem occurs even with nbthread = 1. I've managed
 to confirm it today.


 I'm curious is it crashed at the same place with the same value?

 --
 wbr, Kirill



>>>
>>


Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Great idea Kirill,

With such modification:

struct h2s {
[...]
struct tasklet *shut_tl;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
};

it crashed just like before.

pon., 2 lis 2020 o 11:12 Kirill A. Korinsky  napisał(a):

> Hi,
>
> Thanks for update.
>
> After read Wully's recommendation and provided commit that fixed an issue
> I'm curious can you "edit" a bit this commit and move `shut_tl` before
> `recv_wait` instead of removed `wait_event`?
>
> It is a quiet dummy way to confirm that memory corruption had gone, and
> not just moved to somewhere else.
>
> --
> wbr, Kirill
>
> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
>
> Hi,
>
> Update for people on the list that might be interested in the issue,
> because part of discussion was private.
>
> I wanted to check Willy suggestion and modified h2s struct (added dummy
> fields):
>
> struct h2s {
> [...]
> uint16_t status; /* HTTP response status */
> unsigned long long body_len; /* remaining body length according to
> content-length if H2_SF_DATA_CLEN */
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
> real buffer) */
> int dummy0;
> struct wait_event wait_event; /* Wait list, when we're attempting
> to send a RST but we can't send */
> int dummy1;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> int dummy2;
> struct wait_event *send_wait; /* send wait_event the conn_stream
> associated is waiting on (via h2_subscribe) */
> int dummy3;
> struct list list; /* To be used when adding in h2c->send_list or
> h2c->fctl_lsit */
> struct list sending_list; /* To be used when adding in
> h2c->sending_list */
> };
>
> With such modified h2s struct, the crash did not occur.
>
> I've checked HAProxy 2.1, it crashes like 2.0.
>
> I've also checked 2.2, bisection showed that this commit:
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
> fixed the crashes we experienced. I'm not sure how/if it fixed the memory
> corruption, it is possible that memory is still corrupted but not causing
> the crash.
>
>
>
> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky  napisał(a):
>
>> Very interesting.
>>
>> Anyway, I can see that this pice of code was refactored some time ago:
>> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>>
>> Maybe it is worth to try 2.2 or 2.3 branch?
>>
>> Yes, it is a blind shot and just a guess.
>>
>> --
>> wbr, Kirill
>>
>> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>>
>> Yes at the same place with same value:
>>
>> (gdb) bt full
>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
>> src/mux_h2.c:783
>> sw = 0x
>>
>>
>>
>> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky 
>> napisał(a):
>>
>>> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
>>> >
>>> > I was mailing outside the list with Willy and Christopher but it's
>>> worth sharing that the problem occurs even with nbthread = 1. I've managed
>>> to confirm it today.
>>>
>>>
>>> I'm curious is it crashed at the same place with the same value?
>>>
>>> --
>>> wbr, Kirill
>>>
>>>
>>>
>>
>


Re: [2.0.17] crash with coredump

2020-11-02 Thread Kirill A. Korinsky
Hi,

Thanks for update.

After read Wully's recommendation and provided commit that fixed an issue I'm 
curious can you "edit" a bit this commit and move `shut_tl` before `recv_wait` 
instead of removed `wait_event`?

It is a quiet dummy way to confirm that memory corruption had gone, and not 
just moved to somewhere else.

--
wbr, Kirill

> On 2. Nov 2020, at 10:58, Maciej Zdeb  wrote:
> 
> Hi,
> 
> Update for people on the list that might be interested in the issue, because 
> part of discussion was private.
> 
> I wanted to check Willy suggestion and modified h2s struct (added dummy 
> fields):
> 
> struct h2s {
> [...]
> uint16_t status; /* HTTP response status */
> unsigned long long body_len; /* remaining body length according to 
> content-length if H2_SF_DATA_CLEN */
> struct buffer rxbuf; /* receive buffer, always valid (buf_empty or 
> real buffer) */
> int dummy0;
> struct wait_event wait_event; /* Wait list, when we're attempting to 
> send a RST but we can't send */
> int dummy1;
> struct wait_event *recv_wait; /* recv wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> int dummy2;
> struct wait_event *send_wait; /* send wait_event the conn_stream 
> associated is waiting on (via h2_subscribe) */
> int dummy3;
> struct list list; /* To be used when adding in h2c->send_list or 
> h2c->fctl_lsit */
> struct list sending_list; /* To be used when adding in 
> h2c->sending_list */
> };
> 
> With such modified h2s struct, the crash did not occur.
> 
> I've checked HAProxy 2.1, it crashes like 2.0.
> 
> I've also checked 2.2, bisection showed that this commit: 
> http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
>  
> 
>  fixed the crashes we experienced. I'm not sure how/if it fixed the memory 
> corruption, it is possible that memory is still corrupted but not causing the 
> crash.
> 
> 
> 
> pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky  > napisał(a):
> Very interesting.
> 
> Anyway, I can see that this pice of code was refactored some time ago: 
> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>  
> 
> 
> Maybe it is worth to try 2.2 or 2.3 branch?
> 
> Yes, it is a blind shot and just a guess.
> 
> --
> wbr, Kirill
> 
>> On 25. Sep 2020, at 16:01, Maciej Zdeb > > wrote:
>> 
>> Yes at the same place with same value:
>> 
>> (gdb) bt full
>> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at 
>> src/mux_h2.c:783
>> sw = 0x
>> 
>> 
>> 
>> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky > > napisał(a):
>> > On 25. Sep 2020, at 15:26, Maciej Zdeb > > > wrote:
>> >
>> > I was mailing outside the list with Willy and Christopher but it's worth 
>> > sharing that the problem occurs even with nbthread = 1. I've managed to 
>> > confirm it today.
>> 
>> 
>> I'm curious is it crashed at the same place with the same value?
>> 
>> --
>> wbr, Kirill
>> 
>> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Hi,

Update for people on the list that might be interested in the issue,
because part of discussion was private.

I wanted to check Willy suggestion and modified h2s struct (added dummy
fields):

struct h2s {
[...]
uint16_t status; /* HTTP response status */
unsigned long long body_len; /* remaining body length according to
content-length if H2_SF_DATA_CLEN */
struct buffer rxbuf; /* receive buffer, always valid (buf_empty or
real buffer) */
int dummy0;
struct wait_event wait_event; /* Wait list, when we're attempting
to send a RST but we can't send */
int dummy1;
struct wait_event *recv_wait; /* recv wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
int dummy2;
struct wait_event *send_wait; /* send wait_event the conn_stream
associated is waiting on (via h2_subscribe) */
int dummy3;
struct list list; /* To be used when adding in h2c->send_list or
h2c->fctl_lsit */
struct list sending_list; /* To be used when adding in
h2c->sending_list */
};

With such modified h2s struct, the crash did not occur.

I've checked HAProxy 2.1, it crashes like 2.0.

I've also checked 2.2, bisection showed that this commit:
http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37
fixed the crashes we experienced. I'm not sure how/if it fixed the memory
corruption, it is possible that memory is still corrupted but not causing
the crash.



pt., 25 wrz 2020 o 16:25 Kirill A. Korinsky  napisał(a):

> Very interesting.
>
> Anyway, I can see that this pice of code was refactored some time ago:
> https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
>
> Maybe it is worth to try 2.2 or 2.3 branch?
>
> Yes, it is a blind shot and just a guess.
>
> --
> wbr, Kirill
>
> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
>
> Yes at the same place with same value:
>
> (gdb) bt full
> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
> src/mux_h2.c:783
> sw = 0x
>
>
>
> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky  napisał(a):
>
>> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
>> >
>> > I was mailing outside the list with Willy and Christopher but it's
>> worth sharing that the problem occurs even with nbthread = 1. I've managed
>> to confirm it today.
>>
>>
>> I'm curious is it crashed at the same place with the same value?
>>
>> --
>> wbr, Kirill
>>
>>
>>
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
Very interesting.

Anyway, I can see that this pice of code was refactored some time ago: 
https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca
 


Maybe it is worth to try 2.2 or 2.3 branch?

Yes, it is a blind shot and just a guess.

--
wbr, Kirill

> On 25. Sep 2020, at 16:01, Maciej Zdeb  wrote:
> 
> Yes at the same place with same value:
> 
> (gdb) bt full
> #0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at 
> src/mux_h2.c:783
> sw = 0x
> 
> 
> 
> pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky  > napisał(a):
> > On 25. Sep 2020, at 15:26, Maciej Zdeb  > > wrote:
> >
> > I was mailing outside the list with Willy and Christopher but it's worth 
> > sharing that the problem occurs even with nbthread = 1. I've managed to 
> > confirm it today.
> 
> 
> I'm curious is it crashed at the same place with the same value?
> 
> --
> wbr, Kirill
> 
> 



signature.asc
Description: Message signed with OpenPGP


Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
On Fri, Sep 25, 2020 at 03:26:47PM +0200, Kirill A. Korinsky wrote:
> > On 25. Sep 2020, at 15:06, Willy Tarreau  wrote:
> > 
> > Till here your analysis is right but:
> >  - the overflow would only be at most the number of extra threads running
> >init_genrand() concurrently, or more precisely the distance between
> >the most upfront to the latest thread, so in the worst case nbthread-1
> >hence there's no way to write a single location into a totally unrelated
> >structure without writing the nbthread-2 words that are between the end
> >of the MT array and the overwritten location ;
> 
> I don't think so.
> 
> We have two threads and steps
> 
> 1. T1: mti < N
> 2. T2: mti < N
> 3. T1:  mtii++
> 4. T2:  mtii++
> 5. T1: mt[mti] = (1812433253UL * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);
> 6. T2: mt[mti] = (1812433253UL * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);
> 
> Let assume that mti = N - 1 at steps 1 and 2. At this case mti (aka MT->o) is 
> N + 1 that is overflow of mt array (MT->v).
> 
> So, when T1 writes something at step 5 it puts a random value to mti (aka
> MT->i). Here I assume that the sume has dwrod similar that size(long) and MT
> hasn't got any gaps between v and I.
> 
> If so at step 6 we have mti with random value and writes to unpredicted place
> in memory.

Oh indeed you're totally right, I didn't notice that i was stored right after
the array! So yes, the index can be overflown with a random value. However,
the probability that a random 32-bit value used as an offset relative to
the previous array remains valid without crashing the process is extremely
low, and the probability that each time it lands in the exact same field of
another totally unrelated structure is even lower.

> >  - the lua code is not threaded (there's a "big lock" around lua since
> >stacks cannot easily be protected, though Thierry has some ideas for
> >this).
> 
> What does not threaded mean here?

My understanding is that this code is loaded from within Lua. And you
have only one thread running Lua at a time, since Lua itself is not
thread safe (it can be built to use external locks but apparently noone
builds it like this by default so it's not usable this way), this is
protected by hlua_global_lock. So you won't have two competing calls
to lrandom from Lua.

However I agree that the bug you found (which I looked for and didn't
spot last time I checked) is real for threaded applications.

Willy



Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
On Fri, Sep 25, 2020 at 03:26:05PM +0200, Maciej Zdeb wrote:
> > Here I can suggest to implement Yarrow PRGN (that is very simple to
> > implement) with some lua-pure cryptographic hash function.
> 
> We're using lrandom because of the algorithm Mersenne Twister and its well
> known weaknesses and strengths.
> 
> > In fact I know it's possible to call haproxy's internal sample fetch
> > functions from Lua (I never can figure how to do that, I always need
> > to lookup an example for this unfortunately). But once you figure out how
> > to do it, you can call the "rand()" sample fetch that will call the
> > internal thread-safe random number generator.
> 
> Rand() sample fetch cannot be seeded (at the time we checked) so on HAProxy
> servers with nbproc > 1 we got multiple sequences of the same random
> numbers - it was one of the reasons we couldn't use it.

That was fixed long ago in 2.2-dev4 exactly for this reason:

  commit 52bf839394e683eec2fa8aafff5a0dd51d2dd365
  Author: Willy Tarreau 
  Date:   Sun Mar 8 00:42:37 2020 +0100

BUG/MEDIUM: random: implement a thread-safe and process-safe PRNG

This is the replacement of failed attempt to add thread safety and
per-process sequences of random numbers initally tried with commit
1c306aa84d ("BUG/MEDIUM: random: implement per-thread and per-process
random sequences").
(...)
It supports fast jumps allowing to cut the period into smaller
non-overlapping sequences, which we use here to support up to 2^32
processes each having their own, non-overlapping sequence of 2^96
numbers (~7*10^28). This is enough to provide 1 billion randoms per
second and per process for 2200 billion years.

This was backported into 2.0.14 as well. So if you know how to use it
you definitely can right now. But as I mentioned, the thread-unsafety
of lrandom isn't related to your issue at the moment anyway.

> I was mailing outside the list with Willy and Christopher but it's worth
> sharing that the problem occurs even with nbthread = 1. I've managed to
> confirm it today.

Yes, thanks for the info by the way, however being blocked on something
else at the moment I didn't have the opportunity to have a look at it yet.

Regards,
Willy



Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Yes at the same place with same value:

(gdb) bt full
#0  0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at
src/mux_h2.c:783
sw = 0x



pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky  napisał(a):

> > On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
> >
> > I was mailing outside the list with Willy and Christopher but it's worth
> sharing that the problem occurs even with nbthread = 1. I've managed to
> confirm it today.
>
>
> I'm curious is it crashed at the same place with the same value?
>
> --
> wbr, Kirill
>
>
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
> On 25. Sep 2020, at 15:26, Maciej Zdeb  wrote:
> 
> I was mailing outside the list with Willy and Christopher but it's worth 
> sharing that the problem occurs even with nbthread = 1. I've managed to 
> confirm it today.


I'm curious is it crashed at the same place with the same value?

--
wbr, Kirill




signature.asc
Description: Message signed with OpenPGP


Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
> On 25. Sep 2020, at 15:06, Willy Tarreau  wrote:
> 
> Till here your analysis is right but:
>  - the overflow would only be at most the number of extra threads running
>init_genrand() concurrently, or more precisely the distance between
>the most upfront to the latest thread, so in the worst case nbthread-1
>hence there's no way to write a single location into a totally unrelated
>structure without writing the nbthread-2 words that are between the end
>of the MT array and the overwritten location ;

I don't think so.

We have two threads and steps

1. T1: mti < N
2. T2: mti < N
3. T1:  mtii++
4. T2:  mtii++
5. T1: mt[mti] = (1812433253UL * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);
6. T2: mt[mti] = (1812433253UL * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);

Let assume that mti = N - 1 at steps 1 and 2. At this case mti (aka MT->o) is N 
+ 1 that is overflow of mt array (MT->v).

So, when T1 writes something at step 5 it puts a random value to mti (aka 
MT->i). Here I assume that the sume has dwrod similar that size(long) and MT 
hasn't got any gaps between v and I.

If so at step 6 we have mti with random value and writes to unpredicted place 
in memory.

>  - the lua code is not threaded (there's a "big lock" around lua since
>stacks cannot easily be protected, though Thierry has some ideas for
>this).

What does not threaded mean here?

I can understand it as all lua function is executed from the only one thread 
(1), or one call is executed at the same thread but different thread can 
execute the same lua code for different TCP/HTTP streams at the same time (2).


--
wbr, Kirill



signature.asc
Description: Message signed with OpenPGP


Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.

We're using lrandom because of the algorithm Mersenne Twister and its well
known weaknesses and strengths.

> In fact I know it's possible to call haproxy's internal sample fetch
> functions from Lua (I never can figure how to do that, I always need
> to lookup an example for this unfortunately). But once you figure out how
> to do it, you can call the "rand()" sample fetch that will call the
> internal thread-safe random number generator.

Rand() sample fetch cannot be seeded (at the time we checked) so on HAProxy
servers with nbproc > 1 we got multiple sequences of the same random
numbers - it was one of the reasons we couldn't use it.

I was mailing outside the list with Willy and Christopher but it's worth
sharing that the problem occurs even with nbthread = 1. I've managed to
confirm it today.

pt., 25 wrz 2020 o 15:06 Willy Tarreau  napisał(a):

> Hi Kirill,
>
> On Fri, Sep 25, 2020 at 12:34:16PM +0200, Kirill A. Korinsky wrote:
> > I've extracted a pice of code from lrandom and put it here:
> > https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
> > 
> >
> > You can see that object generator contains N words (and here it is 624),
> and
> > I use an assumption that Maciej's code doesn't create a new generator for
> > each request and share lrandom.
> >
> > Idea of this RNG is initialize each N words via init_genrand and it
> checking
> > that all of them are used, and after one generated a new ones.
> >
> > Let assume that we called genrand_int32 at the same moment from two
> threads.
> > If condition at lines 39 and 43 are true we start to initialize the next
> > words at both threads.
> >
> > You can see that we can easy move outside of v array at line 21 because
> two
> > threads are increasing i field, and put some random number to i field.
>
> Till here your analysis is right but:
>   - the overflow would only be at most the number of extra threads running
> init_genrand() concurrently, or more precisely the distance between
> the most upfront to the latest thread, so in the worst case nbthread-1
> hence there's no way to write a single location into a totally
> unrelated
> structure without writing the nbthread-2 words that are between the end
> of the MT array and the overwritten location ;
>
>   - the lua code is not threaded (there's a "big lock" around lua since
> stacks cannot easily be protected, though Thierry has some ideas for
> this).
>
> > Ans when the second thread is going to line 27 and nobody knows where it
> put 0x
>
> Actually init_genrand() doesn't actually *write* 0x but only writes
> *something* of 64 bits size and applies a 32-bit mask over it, so it would
> have written 32 bits pseudo-random bits followed by 4 zero bytes.
>
> > How can it be proved / solved?
> >
> > I see a few possible options:
> > 1. Switch off threads inside haproxy
> > 2. Use dedicated lrandom per thread
> > 3. Move away from lrandom
> >
> > As I understand lrandom is using here because it is very fast and
> secure, and
> > reading from /dev/urandom isn't an option.
> >
> > Here I can suggest to implement Yarrow PRGN (that is very simple to
> > implement) with some lua-pure cryptographic hash function.
>
> In fact I know it's possible to call haproxy's internal sample fetch
> functions from Lua (I never can figure how to do that, I always need
> to lookup an example for this unfortunately). But once you figure how
> to do it, you can call the "rand()" sample fetch that will call the
> internal thread-safe random number generator.
>
> Regards,
> Willy
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
Hi Kirill,

On Fri, Sep 25, 2020 at 12:34:16PM +0200, Kirill A. Korinsky wrote:
> I've extracted a pice of code from lrandom and put it here:
> https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
> 
> 
> You can see that object generator contains N words (and here it is 624), and
> I use an assumption that Maciej's code doesn't create a new generator for
> each request and share lrandom.
> 
> Idea of this RNG is initialize each N words via init_genrand and it checking
> that all of them are used, and after one generated a new ones.
> 
> Let assume that we called genrand_int32 at the same moment from two threads.
> If condition at lines 39 and 43 are true we start to initialize the next
> words at both threads.
> 
> You can see that we can easy move outside of v array at line 21 because two
> threads are increasing i field, and put some random number to i field.

Till here your analysis is right but:
  - the overflow would only be at most the number of extra threads running
init_genrand() concurrently, or more precisely the distance between
the most upfront to the latest thread, so in the worst case nbthread-1
hence there's no way to write a single location into a totally unrelated
structure without writing the nbthread-2 words that are between the end
of the MT array and the overwritten location ;

  - the lua code is not threaded (there's a "big lock" around lua since
stacks cannot easily be protected, though Thierry has some ideas for
this).

> Ans when the second thread is going to line 27 and nobody knows where it put 
> 0x

Actually init_genrand() doesn't actually *write* 0x but only writes
*something* of 64 bits size and applies a 32-bit mask over it, so it would
have written 32 bits pseudo-random bits followed by 4 zero bytes.

> How can it be proved / solved?
> 
> I see a few possible options:
> 1. Switch off threads inside haproxy
> 2. Use dedicated lrandom per thread
> 3. Move away from lrandom
> 
> As I understand lrandom is using here because it is very fast and secure, and
> reading from /dev/urandom isn't an option.
> 
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.

In fact I know it's possible to call haproxy's internal sample fetch
functions from Lua (I never can figure how to do that, I always need
to lookup an example for this unfortunately). But once you figure how
to do it, you can call the "rand()" sample fetch that will call the
internal thread-safe random number generator.

Regards,
Willy



Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Hi Kirill,

Thanks for your hints and time! Unfortunately, I think lrandom is not the
cause of crash. We're using lrandom with threads for couple of months on
our other servers without any crash. I think lua in HAproxy is executed in
a single thread so your analysis is correct but this assumption is never
true: "Let assume that we called genrand_int32 at the same moment from two
threads." in HAProxy environment.

I suspect something is going on in SPOE or LUA scripts from external
vendor. I'll share more details as soon as I confirm it is in SPOE or LUA.


pt., 25 wrz 2020 o 12:34 Kirill A. Korinsky  napisał(a):

> Good day,
>
> I'd like to share with your my two cents regarding this topic:
>
> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
> problems, and soon we will drop it from our build)
>
>
> Never heard of this last one, not that it would make it suspicious at
> all, just that it might indicate you're having a slightly different
> workload than most common ones and can help spotting directions where
> to look for the problem.
>
>
>
> As far as I know Haproxy is using threads by default for some time and I
> assume that Maciej's setup doesn't change anything and it had enabled
> threads.
>
> If so I believe that lrandom is the root cause of this issue.
>
> I've extracted a pice of code from lrandom and put it here:
> https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
>
> You can see that object generator contains N words (and here it is 624),
> and I use an assumption that Maciej's code doesn't create a new generator
> for each request and share lrandom.
>
> Idea of this RNG is initialize each N words via init_genrand and it
> checking that all of them are used, and after one generated a new ones.
>
> Let assume that we called genrand_int32 at the same moment from two
> threads. If condition at lines 39 and 43 are true we start to initialize
> the next words at both threads.
>
> You can see that we can easy move outside of v array at line 21 because
> two threads are increasing i field, and put some random number to i field.
>
> Ans when the second thread is going to line 27 and nobody knows where it
> put 0x
>
> Let me quote Willy Tarreau:
>
> In the trace it's said that sw = 0x. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.
>
>
> and base on this I believe that it is the case.
>
> How can it be proved / solved?
>
> I see a few possible options:
> 1. Switch off threads inside haproxy
> 2. Use dedicated lrandom per thread
> 3. Move away from lrandom
>
> As I understand lrandom is using here because it is very fast and secure,
> and reading from /dev/urandom isn't an option.
>
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.
>
> --
> wbr, Kirill
>
>


Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
Good day,

I'd like to share with your my two cents regarding this topic:

>> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
>> problems, and soon we will drop it from our build)
> 
> Never heard of this last one, not that it would make it suspicious at
> all, just that it might indicate you're having a slightly different
> workload than most common ones and can help spotting directions where
> to look for the problem.


As far as I know Haproxy is using threads by default for some time and I assume 
that Maciej's setup doesn't change anything and it had enabled threads.

If so I believe that lrandom is the root cause of this issue.

I've extracted a pice of code from lrandom and put it here: 
https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416 


You can see that object generator contains N words (and here it is 624), and I 
use an assumption that Maciej's code doesn't create a new generator for each 
request and share lrandom.

Idea of this RNG is initialize each N words via init_genrand and it checking 
that all of them are used, and after one generated a new ones.

Let assume that we called genrand_int32 at the same moment from two threads. If 
condition at lines 39 and 43 are true we start to initialize the next words at 
both threads.

You can see that we can easy move outside of v array at line 21 because two 
threads are increasing i field, and put some random number to i field.

Ans when the second thread is going to line 27 and nobody knows where it put 
0x

Let me quote Willy Tarreau:

> In the trace it's said that sw = 0x. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.

and base on this I believe that it is the case.

How can it be proved / solved?

I see a few possible options:
1. Switch off threads inside haproxy
2. Use dedicated lrandom per thread
3. Move away from lrandom

As I understand lrandom is using here because it is very fast and secure, and 
reading from /dev/urandom isn't an option.

Here I can suggest to implement Yarrow PRGN (that is very simple to implement) 
with some lua-pure cryptographic hash function.

--
wbr, Kirill



signature.asc
Description: Message signed with OpenPGP


Re: [2.0.17] crash with coredump

2020-09-17 Thread Willy Tarreau
On Thu, Sep 17, 2020 at 10:56:39AM +0200, Maciej Zdeb wrote:
> Hi,
> 
> Our config is quite complex and I'm trying to narrow it down. It is
> occurring only on one production haproxy cluster (which consists of 6
> servers in each of two data centers) with significant load - crashes occurs
> on random servers so I would exclude memory corruption.

When I'm saying "memory corruption" I don't necessarily mean hardware
error, most likely a software error. For example a write after free to
a memory location, or any such thing, which typically happens outside
of the visible code path.

> I'm suspecting SPOE or/and LUA script both are used to send metadata about
> each request to an external endpoint. Yesterday I disabled this feature in
> one datacenter to verify.

Great!

> Our build is done in docker (Ubuntu bionic) with kernel 4.9.184-linuxkit,
> crash is on Ubuntu bionic 4.15.0-55-generic, using:
> haproxy 2.0.17
> openssl 1.1.1f
> pcre 8.44
> lua 5.3.5
> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
> problems, and soon we will drop it from our build)

Never heard of this last one, not that it would make it suspicious at
all, just that it might indicate you're having a slightly different
workload than most common ones and can help spotting directions where
to look for the problem.

> compiled in following way:
(...)

OK, nothing unusual here, thanks for the details.

Let's wait for your new tests to narrow down the issue a little bit more,
then.

Thanks,
Willy



Re: [2.0.17] crash with coredump

2020-09-17 Thread Maciej Zdeb
Hi,

Our config is quite complex and I'm trying to narrow it down. It is
occurring only on one production haproxy cluster (which consists of 6
servers in each of two data centers) with significant load - crashes occurs
on random servers so I would exclude memory corruption.

I'm suspecting SPOE or/and LUA script both are used to send metadata about
each request to an external endpoint. Yesterday I disabled this feature in
one datacenter to verify.

Our build is done in docker (Ubuntu bionic) with kernel 4.9.184-linuxkit,
crash is on Ubuntu bionic 4.15.0-55-generic, using:
haproxy 2.0.17
openssl 1.1.1f
pcre 8.44
lua 5.3.5
lrandom (PRNG for lua, we're using it for 2 or 3 years without any
problems, and soon we will drop it from our build)

compiled in following way:

# LUA
wget http://www.lua.org/ftp/lua-$LUA_VERSION.tar.gz \
&& tar -zxf lua-$LUA_VERSION.tar.gz \
&& cd lua-$LUA_VERSION \
&& make linux test \
&& make install

# LUA LRANDOM
wget http://webserver2.tecgraf.puc-rio.br/~lhf/ftp/lua/ar/lrandom-100.tar.gz
&& tar -zxf lrandom-100.tar.gz \
&& make -C lrandom-100 \
&& make -C lrandom-100 install

# PCRE
wget https://ftp.pcre.org/pub/pcre/pcre-$PCRE_VERSION.tar.gz \
&& tar -zxf pcre-$PCRE_VERSION.tar.gz \
&& cd pcre-$PCRE_VERSION \
&& ./configure --prefix=/usr/lib/haproxy/pcre_$PCRE_VERSION
--enable-jit --enable-utf --enable-unicode-properties
--disable-silent-rules \
&& make \
&& make install

# OPENSSL
wget https://www.openssl.org/source/openssl-$SSL_VERSION.tar.gz \
&& tar -zxf openssl-$SSL_VERSION.tar.gz \
&& cd openssl-$SSL_VERSION \
&& ./Configure --openssldir=/usr/lib/haproxy/openssl_$SSL_VERSION
--prefix=/usr/lib/haproxy/openssl_$SSL_VERSION
-Wl,-rpath=/usr/lib/haproxy/openssl_$SSL_VERSION/lib shared no-idea
linux-x86_64 \
&& make depend \
&& make \
&& make install_sw

and finally haproxy is compiled using deb builder:

override_dh_auto_build:
make TARGET=$(HAP_TARGET) DEFINE="-DIP_BIND_ADDRESS_NO_PORT=24
-DMAX_SESS_STKCTR=12" USE_PCRE=1 USE_PCRE_JIT=1
PCRE_INC=/usr/lib/haproxy/pcre_$(PCRE_VERSION)/include
PCRE_LIB="/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib
-Wl,-rpath,/usr/lib/haproxy/pcre_$(PCRE_VERSION)/lib" USE_GETADDRINFO=1
USE_OPENSSL=1 SSL_INC=/usr/lib/haproxy/openssl_$(SSL_VERSION)/include
SSL_LIB="/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib
-Wl,-rpath,/usr/lib/haproxy/openssl_$(SSL_VERSION)/lib" ADDLIB=-ldl
USE_ZLIB=1 USE_DL=1 USE_LUA=1 USE_REGPARM=1

DIP_BIND_ADDRESS_NO_PORT is now absolete and we'll drop it
MAX_SESS_STKCTR=12 we need more stick tables

Kind regards,


czw., 17 wrz 2020 o 08:18 Willy Tarreau  napisał(a):

> Hi guys,
>
> On Thu, Sep 17, 2020 at 11:05:31AM +1000, Igor Cicimov wrote:
> (...)
> > > Coredump fragment from thread1:
> > > (gdb) bt
> > > #0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
> > > src/mux_h2.c:783
>
> So the code is this one:
>
>777  static void __maybe_unused h2s_notify_recv(struct h2s *h2s)
>778  {
>779  struct wait_event *sw;
>780
>781  if (h2s->recv_wait) {
>782  sw = h2s->recv_wait;
>783  sw->events &= ~SUB_RETRY_RECV;
>784  tasklet_wakeup(sw->tasklet);
>785  h2s->recv_wait = NULL;
>786  }
>787  }
>
> In the trace it's said that sw = 0x. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.
>
> > I'm not one of the devs but obviously many of us using v2.0 will be
> > interested in the answer. Assuming you do not install from packages can
> you
> > please provide some more background on how you produce the binary, like
> if
> > you compile then what OS and kernel is this compiled on and what OS and
> > kernel this crashes on? Again if compiled any other custom compiled
> > packages in use, like OpenSSL, lua etc, you might be using or have
> compiled
> > haproxy against etc.?
> >
> > Also if this is a bug and you have hit some corner case with your config
> > (many are using 2.0 but we have not seen crashes) you should provide a
> > stripped down version (not too stripped though just the sensitive data)
> of
> > your config too.
>
> I agree with Igor here, any info to try to narrow down a reproducer, both
> in terms of config and operations, would be tremendously helpful!
>
> Thanks,
> Willy
>


Re: [2.0.17] crash with coredump

2020-09-17 Thread Willy Tarreau
Hi guys,

On Thu, Sep 17, 2020 at 11:05:31AM +1000, Igor Cicimov wrote:
(...)
> > Coredump fragment from thread1:
> > (gdb) bt
> > #0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
> > src/mux_h2.c:783

So the code is this one:

   777  static void __maybe_unused h2s_notify_recv(struct h2s *h2s)
   778  {
   779  struct wait_event *sw;
   780  
   781  if (h2s->recv_wait) {
   782  sw = h2s->recv_wait;
   783  sw->events &= ~SUB_RETRY_RECV;
   784  tasklet_wakeup(sw->tasklet);
   785  h2s->recv_wait = NULL;
   786  }
   787  }

In the trace it's said that sw = 0x. Looking at all places where
h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
structure. We could have imagined that for whatever reason h2s is wrong
here, but this call only happens when its state is still valid, and it
experiences double dereferences before landing here, which tends to
indicate that the h2s pointer is OK. Thus the only hypothesis I can have
for now is memory corruption :-/ That field would get overwritten with
(int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
as if we had many of these.

> I'm not one of the devs but obviously many of us using v2.0 will be
> interested in the answer. Assuming you do not install from packages can you
> please provide some more background on how you produce the binary, like if
> you compile then what OS and kernel is this compiled on and what OS and
> kernel this crashes on? Again if compiled any other custom compiled
> packages in use, like OpenSSL, lua etc, you might be using or have compiled
> haproxy against etc.?
> 
> Also if this is a bug and you have hit some corner case with your config
> (many are using 2.0 but we have not seen crashes) you should provide a
> stripped down version (not too stripped though just the sensitive data) of
> your config too.

I agree with Igor here, any info to try to narrow down a reproducer, both
in terms of config and operations, would be tremendously helpful!

Thanks,
Willy



Re: [2.0.17] crash with coredump

2020-09-16 Thread Igor Cicimov
Hi Maciej,

On Wed, Sep 16, 2020 at 9:00 PM Maciej Zdeb  wrote:

> Hi,
>
> Our HAProxy (2.0.14) started to crash, so first we upgraded to 2.0.17 but
> it didn't help. Below you'll find traces from coredump
>
> Version:
> HA-Proxy version 2.0.17 2020/07/31 - https://haproxy.org/
> Build options :
>   TARGET  = linux-glibc
>   CPU = generic
>   CC  = gcc
>   CFLAGS  = -O0 -g -fno-strict-aliasing -Wdeclaration-after-statement
> -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
> -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
> -Wno-missing-field-initializers -Wno-implicit-fallthrough
> -Wno-stringop-overflow -Wtype-limits -Wshift-negative-value
> -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
> -DIP_BIND_ADDRESS_NO_PORT=24 -DMAX_SESS_STKCTR=12
>   OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_REGPARM=1 USE_GETADDRINFO=1
> USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DL=1
>
> Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE
> +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED
> +REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE
> +LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4
> -MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS
> -51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS
>
> Default settings :
>   bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>
> Built with multi-threading support (MAX_THREADS=64, default=4).
> Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
> Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
> OpenSSL library supports TLS extensions : yes
> OpenSSL library supports SNI : yes
> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
> Built with Lua version : Lua 5.3.5
> Built with network namespace support.
> Built with transparent proxy support using: IP_TRANSPARENT
> IPV6_TRANSPARENT IP_FREEBIND
> Built with zlib version : 1.2.11
> Running on zlib version : 1.2.11
> Compression algorithms supported : identity("identity"),
> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
> Built with PCRE version : 8.44 2020-02-12
> Running on PCRE version : 8.44 2020-02-12
> PCRE library supports JIT : yes
> Encrypted password support via crypt(3): yes
>
> Available polling systems :
>   epoll : pref=300,  test result OK
>poll : pref=200,  test result OK
>  select : pref=150,  test result OK
> Total: 3 (3 usable), will use epoll.
>
> Available multiplexer protocols :
> (protocols marked as  cannot be specified using 'proto' keyword)
>   h2 : mode=HTTP   side=FEmux=H2
>   h2 : mode=HTXside=FE|BE mux=H2
> : mode=HTXside=FE|BE mux=H1
> : mode=TCP|HTTP   side=FE|BE mux=PASS
>
> Available services : none
>
> Available filters :
> [SPOE] spoe
> [COMP] compression
> [CACHE] cache
> [TRACE] trace
>
>
> Coredump fragment from thread1:
> (gdb) bt
> #0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
> src/mux_h2.c:783
> #1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at
> src/mux_h2.c:921
> #2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
> htx=0x7f65a9c34f20) at src/mux_h2.c:5385
> #3  0x55cbbf6fa48e in h2_snd_buf (cs=0x7f65b8c48a40,
> buf=0x7f65d05291b8, count=2, flags=1) at src/mux_h2.c:5694
> #4  0x55cbbf7cdde3 in si_cs_send (cs=0x7f65b8c48a40) at
> src/stream_interface.c:762
> #5  0x55cbbf7ce839 in stream_int_chk_snd_conn (si=0x7f65d0529478) at
> src/stream_interface.c:1145
> #6  0x55cbbf7cc9d6 in si_chk_snd (si=0x7f65d0529478) at
> include/proto/stream_interface.h:496
> #7  0x55cbbf7cd559 in stream_int_notify (si=0x7f65d05294d0) at
> src/stream_interface.c:510
> #8  0x55cbbf7cda33 in si_cs_process (cs=0x55cbca178f90) at
> src/stream_interface.c:644
> #9  0x55cbbf7cdfb1 in si_cs_io_cb (t=0x0, ctx=0x7f65d05294d0, state=1)
> at src/stream_interface.c:817
> #10 0x55cbbf81af32 in process_runnable_tasks () at src/task.c:415
> #11 0x55cbbf740cc0 in run_poll_loop () at src/haproxy.c:2701
> #12 0x55cbbf741188 in run_thread_poll_loop (data=0x1) at
> src/haproxy.c:2840
> #13 0x7f667fbdb6db in start_thread (arg=0x7f65d5556700) at
> pthread_create.c:463
> #14 0x7f667e764a3f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>
> (gdb) bt full
> #0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
> src/mux_h2.c:783
> sw = 0x
> #1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at
> src/mux_h2.c:921
> No locals.
> #2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
> htx=0x7f65a9c34f20) at src/mux_h2.c:5385
> list = {{n = {ptr = 0x55cbbf88d18c "", len = 0}, v = {ptr =
> 0x7f65a9c34f20 "�?", len = 94333580136844}}, {n = {ptr = 0x7f65d5532410 "
> Oée\177", len = 94333579742112}, v = {ptr = 0x7f65a9c38e78 "�\001", len =
> 140074616573728}}, {n = {
>   

[2.0.17] crash with coredump

2020-09-16 Thread Maciej Zdeb
Hi,

Our HAProxy (2.0.14) started to crash, so first we upgraded to 2.0.17 but
it didn't help. Below you'll find traces from coredump

Version:
HA-Proxy version 2.0.17 2020/07/31 - https://haproxy.org/
Build options :
  TARGET  = linux-glibc
  CPU = generic
  CC  = gcc
  CFLAGS  = -O0 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wno-implicit-fallthrough
-Wno-stringop-overflow -Wtype-limits -Wshift-negative-value
-Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference
-DIP_BIND_ADDRESS_NO_PORT=24 -DMAX_SESS_STKCTR=12
  OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_REGPARM=1 USE_GETADDRINFO=1
USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DL=1

Feature list : +EPOLL -KQUEUE -MY_EPOLL -MY_SPLICE +NETFILTER +PCRE
+PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED
+REGPARM -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE
+LIBCRYPT +CRYPT_H -VSYSCALL +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4
-MY_ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT -DEVICEATLAS
-51DEGREES -WURFL -SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_THREADS=64, default=4).
Built with OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f  31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with network namespace support.
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with PCRE version : 8.44 2020-02-12
Running on PCRE version : 8.44 2020-02-12
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as  cannot be specified using 'proto' keyword)
  h2 : mode=HTTP   side=FEmux=H2
  h2 : mode=HTXside=FE|BE mux=H2
: mode=HTXside=FE|BE mux=H1
: mode=TCP|HTTP   side=FE|BE mux=PASS

Available services : none

Available filters :
[SPOE] spoe
[COMP] compression
[CACHE] cache
[TRACE] trace


Coredump fragment from thread1:
(gdb) bt
#0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
src/mux_h2.c:783
#1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at src/mux_h2.c:921
#2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
htx=0x7f65a9c34f20) at src/mux_h2.c:5385
#3  0x55cbbf6fa48e in h2_snd_buf (cs=0x7f65b8c48a40,
buf=0x7f65d05291b8, count=2, flags=1) at src/mux_h2.c:5694
#4  0x55cbbf7cdde3 in si_cs_send (cs=0x7f65b8c48a40) at
src/stream_interface.c:762
#5  0x55cbbf7ce839 in stream_int_chk_snd_conn (si=0x7f65d0529478) at
src/stream_interface.c:1145
#6  0x55cbbf7cc9d6 in si_chk_snd (si=0x7f65d0529478) at
include/proto/stream_interface.h:496
#7  0x55cbbf7cd559 in stream_int_notify (si=0x7f65d05294d0) at
src/stream_interface.c:510
#8  0x55cbbf7cda33 in si_cs_process (cs=0x55cbca178f90) at
src/stream_interface.c:644
#9  0x55cbbf7cdfb1 in si_cs_io_cb (t=0x0, ctx=0x7f65d05294d0, state=1)
at src/stream_interface.c:817
#10 0x55cbbf81af32 in process_runnable_tasks () at src/task.c:415
#11 0x55cbbf740cc0 in run_poll_loop () at src/haproxy.c:2701
#12 0x55cbbf741188 in run_thread_poll_loop (data=0x1) at
src/haproxy.c:2840
#13 0x7f667fbdb6db in start_thread (arg=0x7f65d5556700) at
pthread_create.c:463
#14 0x7f667e764a3f in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) bt full
#0  0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at
src/mux_h2.c:783
sw = 0x
#1  0x55cbbf6edbc7 in h2s_close (h2s=0x7f65b8b55130) at src/mux_h2.c:921
No locals.
#2  0x55cbbf6f9745 in h2s_htx_make_trailers (h2s=0x7f65b8b55130,
htx=0x7f65a9c34f20) at src/mux_h2.c:5385
list = {{n = {ptr = 0x55cbbf88d18c "", len = 0}, v = {ptr =
0x7f65a9c34f20 "�?", len = 94333580136844}}, {n = {ptr = 0x7f65d5532410 "
Oée\177", len = 94333579742112}, v = {ptr = 0x7f65a9c38e78 "�\001", len =
140074616573728}}, {n = {
  ptr = 0x55cbbf6ea203 
"\211E�\213E�\203�\002t\005\203�\005u\035H\213E�\213@\004\017��H\213E�\213@\004��\b%��\017",
len = 81604378627}, v = {ptr = 0x1fd0001 , len = 140074616589944}}, {n = {ptr =
0x7f65a9c34f20 "�?", len = 140075347420216}, v = {
  ptr = 0x55cbbf82bfc0