On Tue, 8 Dec 2020 at 14:55, Christopher Faulet <cfau...@haproxy.com> wrote: > > Le 04/12/2020 à 21:24, Peter Statham a écrit : > > I might have spoken too soon. > > > > The latest release of 1.8 works flawlessly on my debian desktop but > > still crashes when I attempt the same configuration on a CentOS > > virtual machine on our VMWare cluster. > > > > I'm not sure if this is down to differences in the way memory fencing > > or thread scheduling work on these platforms or if it is a > > library/compiler issue. Backporting the LBPRM spinlocks from 1.9's > > src/lb_fwlc.c seems to help but I will continue investigating and > > hopefully rule out some of the other possibilities. > > > > Hum, not good. Peter, it is the same crash or not ? I didn't checked very > deeply, but I guess you backported th e commit 1b87748ff5 ("BUG/MEDIUM: > lb/threads: always properly lock LB algorithms on maintenance operations"). A > comment in the commit message says it may be required on the 1.8 if some bugs > surface in this area. > > However I'm surprised because locked functions are called for the rendez-vous > point. It means all threads are blocked at the same point waiting the updates on > servers are performed. > > -- > Christopher Faulet
My apologies for replying to the wrong address, Christopher. I have pasted the body of that email here. > Sorry for the delay in getting back to you. It is the same crash, > we've been trying to narrow down the exact combination of compiler, > libraries, kernel, hypervisor, etc. that causes the issue now that we > know it isn't universal but that's turning out to be trickier than > identifying the issue. > > I only backported the changes to the src/lb_fwlc.c file, but > backporting 1b87748ff5 seems to work just as well. So far we haven't > been able to provoke the issue with the changes in 1b87748ff5 applied > to the 1.8 tree so that does look like a solution. > > We will keep testing and trying to narrow the issue down. Since I wrote the above I have managed to replicate the issue on 1.8 with applied, so it looks as if that was not the solution after all. I include a binary built from 1.8.27 with 1b87748ff5 backported and a core dump. haproxy-1.8.27+1b87748ff5 <https://drive.google.com/file/d/1KPs3rBpkeqE9GEOfjF8Ocycd1wa4RjqW/view?usp=drive_web> haproxy-1.8.27+1b87748ff5.core <https://drive.google.com/file/d/1chBPoogHBuGlnV1o5sO9YP6BldpRH4d3/view?usp=drive_web> -- Peter Statham Loadbalancer.org Ltd.