I'm pretty sure that we are seeing "the service is down", though only
briefly.  We started looking at the logs because we were seeing testing
failures and failures with our code deploys, which check the haproxy status
as part of rolling the code update to the machines.  We aren't manually
having to restart the service via "service haproxy restart", for example,
just to clarify.

I'm not sure I'm answering your first question, if the above doesn't answer
it, how do I tell if it's the "old one" or the "new one"?  I think you mean
that an haproxy process is restarting at some point during the run and
which one.  Or do you mean the 2.0.13 process before update ("old one") and
the 2.1.3 process after update ("new one").  To clarify: We install the
2.1.3 package, and then with no further interaction that I know of, we get
a few handfuls of segfaults in the logs.

It happens incredibly infrequently.  1-10 times a day.  I wondered if it
might be related to a lot of hits from a single IP address, but manually
doing a bunch of reloads in my web browser, didn't trigger it.  We really
have no idea what triggers it and can't reproduce it at will.

I'm pretty sure we don't have anything that updates the ACL.  At some point
in the past we had a cron job that would query the ratelists to store off
stats, but nothing that updated them, and I don't see that we have that job
on the system after the upgrade (new system spin when we did the haproxy
update).

Thanks,
Sean

On Sat, Mar 21, 2020 at 3:33 AM Willy Tarreau <w...@1wt.eu> wrote:

> On Sat, Mar 21, 2020 at 10:08:15AM +0100, Willy Tarreau wrote:
> > On Fri, Mar 20, 2020 at 08:10:25AM -0600, Sean Reifschneider wrote:
> > > I grabbed the source from the PPA and rebuilt it, installed the dbg
> > > package, and here's one of the "bt full"s:
> >
> > Thanks!
> >
> > > (gdb) bt full
> > > #0  pattern_exec_match (head=head@entry=0x55e4dd275478,
> > > smp=smp@entry=0x7fbf9ef650c0,
> > > fill=fill@entry=0) at src/pattern.c:2541
> > >         __pl_l = <optimized out>
> > >         __pl_r = <optimized out>
> > >         list = 0x0
> > >         pat = <optimized out>
> >
> > This is very strange. The "list" field is null for the expression. That
> > doesn't make much sense in a linked list. This makes me suspect that the
> > previous element was added then freed without being unlinked and was then
> > reused and zeroed.
> >
> > I wanted to issue dev5 right now but I'll first try to figure if this is
> > reproducible and if so, how.
>
> I obviously can't reproduce it and the only line in your config making
> use of L4 rules is perfectly fine and straightforward.
>
> Thus I'm having two questions:
>   - is it the new or the old process that occasionally crashes on reload ?
>     If it's the new one, the service is down. If it's the old one, the
>     service continues and you only know about it from your logs.
>
>   - do you have anything that tries to update the "rate_whitelist" ACL
>     over the stats socket ? We could for example imagine that you're
>     maintaining a whitelist in a separate file that you're uploading
>     upon reloads.
>
> Thanks,
> Willy
>

Reply via email to