Hello We have a somewhat curious issue and run out of ideas ;)
We do not have a trigger to reproduce the issue, but we for example see some IRC disconnects from users behind our firewall. What we have: - two HP Proliant DL360 G5 with Broadcom BCM5708 NICs, 2GB RAM, Intel Xeon E5335@2.0GHz - OpenBSD 5.5 - trunk between the two NICs - 13 VLANs interfaces with carp failover - one VLAN for pfsync - ospfd and ospf6d - approx. 200Mbit/s of traffic - the initial pfysnc takes quite long (~1h) The setup looks like this (not sure if relevant): - both servers have a failover trunk with two interfaces - all traffic including pfsync is sent over this trunk - the problem also occurs, if we disable one box What happens/what we tried: The main issue is, that we occasionally see broken SSH connections and quite a lot of broken IRC connections during the day. It looks a bit like the problem happens more in the evening - however we do not see a correlation with the amount of traffic or number of connections. As a first reaction we updated to the latest stable OpenBSD release which didn't solve the issue. Afterwards we replaced the onboard Broadcom NIC with a PCIe Intel 82576 (em driver) card, however this card seems to cause some new issues - i.e. we see quite some input (rx) errors using "netstat -i". Because we don't see such errors using the Broadcom NICs we decided to not investigate this issue any further and switch back to the Broadcom setup. Besides those steps we also disabled one of the boxes by stopping ospf and removing the carp interfaces - however, the disconnects didn't go away. Furthermore we also checked if any state-tables are overflowing and we didn't find any suspicious kernel messages either. We have quite a similar setup which doesn't show those issues - however we don't have the same amount of traffic over those systems. I uploaded some information about the system to this place: * sysctl -a http://dpaste.com/08VBA93 * pfctl (w/o rules and states) http://dpaste.com/2BBJG5P Feel free to ask for more if needed. Long story short; do you have any hints or ideas where we could look next? Did you ever see such a problem in an other setup? At least to me, it looks like long-during sessions (like IRC) are somehow affected - does this ring some bells? I appreciate any hints and hope that I didn't miss any important information - otherwise feel free to bug me. Thanks in advance and have a nice day! Kind regards, Nicolas