Hello list I have a curious problem, i'm trying to figure out. So far my own searches have failed me, i'm hoping that someone is able to point me to the right direction.
I have a 2-node router-firewall cluster using carp and pfsync. Both nodes have pf running and in addition some ipsec, relayd and bgp. I see multiple times a day, that active firewall seems to pause for a period in the range of 3-10 seconds. I have seen following symptoms during the problem: * no carp advertisements are sent out from primary node; * carp interfaces are not changing state on primary node; * router is not replying to ping requests on interconnect interface; * i have a oneliner running every 5 seconds to gather some metrics (and save it locally). This data has also a gap during this pause. Secondary firewall in this case takes over for that time period. Depending on the length of the pause, not all carp interfaces are always taken over. After primary recovers, carp interfaces fail back to primary node. We are using OpenBSD 7.0 amd64 MP kernel with all available patches on DELL fc430 servers (14-core Xeon E5-2660, 128G memory, Intel X520 (82599) 2-port 10G NIC). Full dmesg https://pastebin.com/1cDq6wpk There are multiple vlan and carp interfaces ontop of trunk interface, networking has this layout: ix0/ix1 > trunk0(lacp) > vlanX > carpX This far i have considered this as a performance issue, but i have run out of ideas, what limit i'm hitting. Things i have checked: memory limits - all failcounters from vmstat -m are zero Switching problems - during the problem, the carp advertisements are not sent out from primary firewall (as opposed to going missing in transit); Hardware fault - i have seen this on 3 different physical servers (all are the same type and in the same fx2 enclosure though). CPU and memory seem to have enough headroom; Firewalls on average forwards 100k pps, with occasional peak around 160k pps - pps peaks do not seem to correlate to these outages; Network interrupts for ix interfaces queues are equally distributed and total 60k-70k with peaks around 130K - peaks do not seem to correlate to the outages; number of pf states are in the range of 200k to 400k, well below configured limit of 1M; If anybody has any suggestions where to look for next clues, i would be grateful. Kind Regards Joosep