On Tue, 8 Dec 2020 at 14:55, Christopher Faulet <cfau...@haproxy.com> wrote:
>
> Le 04/12/2020 à 21:24, Peter Statham a écrit :
> > I might have spoken too soon.
> >
> > The latest release of 1.8 works flawlessly on my debian desktop but
> > still crashes when I attempt the same configuration on a CentOS
> > virtual machine on our VMWare cluster.
> >
> > I'm not sure if this is down to differences in the way memory fencing
> > or thread scheduling work on these platforms or if it is a
> > library/compiler issue.  Backporting the LBPRM spinlocks from 1.9's
> > src/lb_fwlc.c seems to help but I will continue investigating and
> > hopefully rule out some of the other possibilities.
> >
>
> Hum, not good. Peter, it is the same crash or not ? I didn't checked very
> deeply, but I guess you backported th e commit 1b87748ff5 ("BUG/MEDIUM:
> lb/threads: always properly lock LB algorithms on maintenance
operations"). A
> comment in the commit message says it may be required on the 1.8 if some
bugs
> surface in this area.
>
> However I'm surprised because locked functions are called for the
rendez-vous
> point. It means all threads are blocked at the same point waiting the
updates on
> servers are performed.
>
> --
> Christopher Faulet

My apologies for replying to the wrong address, Christopher.  I have pasted
the body of that email here.

> Sorry for the delay in getting back to you.  It is the same crash,
> we've been trying to narrow down the exact combination of compiler,
> libraries, kernel, hypervisor, etc. that causes the issue now that we
> know it isn't universal but that's turning out to be trickier than
> identifying the issue.
>
> I only backported the changes to the src/lb_fwlc.c file, but
> backporting 1b87748ff5 seems to work just as well.  So far we haven't
> been able to provoke the issue with the changes in 1b87748ff5 applied
> to the 1.8 tree so that does look like a solution.
>
> We will keep testing and trying to narrow the issue down.

Since I wrote the above I have managed to replicate the issue on 1.8 with
applied, so it looks as if that was not the solution after all.

I include a binary built from 1.8.27 with 1b87748ff5 backported and a core
dump.

 haproxy-1.8.27+1b87748ff5
<https://drive.google.com/file/d/1KPs3rBpkeqE9GEOfjF8Ocycd1wa4RjqW/view?usp=drive_web>
 haproxy-1.8.27+1b87748ff5.core
<https://drive.google.com/file/d/1chBPoogHBuGlnV1o5sO9YP6BldpRH4d3/view?usp=drive_web>

-- 

Peter Statham
Loadbalancer.org Ltd.

Reply via email to