Hello list

I have a curious problem, i'm trying to figure out. So far my own searches
have failed me, i'm hoping that someone is able to point me to the right
direction.

I have a 2-node router-firewall cluster using carp and pfsync. Both nodes
have pf running and in addition some ipsec, relayd and bgp.
I see multiple times a day, that active firewall seems to pause for a
period in the range of 3-10 seconds.
I have seen following symptoms during the problem:
* no carp advertisements are sent out from primary node;
* carp interfaces are not changing state on primary node;
* router is not replying to ping requests on interconnect interface;
* i have a oneliner running every 5 seconds to gather some metrics (and
save it locally). This data has also a gap during this pause.

Secondary firewall in this case takes over for that time period. Depending
on the length of the pause, not all carp interfaces are always taken over.
After primary recovers, carp interfaces fail back to primary node.

We are using OpenBSD 7.0 amd64 MP kernel with all available patches on DELL
fc430 servers (14-core Xeon E5-2660, 128G memory, Intel X520 (82599) 2-port
10G NIC).
Full dmesg https://pastebin.com/1cDq6wpk
There are multiple vlan and carp interfaces ontop of trunk interface,
networking has this layout:
ix0/ix1 > trunk0(lacp) > vlanX > carpX

This far i have considered this as a performance issue, but i have run out
of ideas, what limit i'm hitting.

Things i have checked:
memory limits - all failcounters from vmstat -m are zero
Switching problems - during the problem, the carp advertisements are not
sent out from primary firewall (as opposed to going missing in transit);
Hardware fault - i have seen this on 3 different physical servers (all are
the same type and in the same fx2 enclosure though).
CPU and memory seem to have enough headroom;
Firewalls on average forwards 100k pps, with occasional peak around 160k
pps - pps peaks do not seem to correlate to these outages;
Network interrupts for ix interfaces queues are equally distributed and
total 60k-70k with peaks around 130K - peaks do not seem to correlate to
the outages;
number of pf states are in the range of 200k to 400k, well below configured
limit of 1M;

If anybody has any suggestions where to look for next clues, i would be
grateful.

Kind Regards
Joosep

Reply via email to