Hi Posted this a week ago on misc with no success. So now I post to "pf" instead with some additional info...
#Setup:# A redundant firewall pair (two HP DL380G4, ciss mirror) with 3 em dual gig nics (plus 2 unused bge), 6 vlans, pfsync and 1500 rows of pf.conf. OpenBSD 3.8 STABLE (updated three weeks ago). The generic kernel is used + backported SACK patch so we could use "synproxy" correctly. #Problem:# This redundant firewall pair just died after a couple of weeks good work. All interfaces use carp. During the last 24 hours before the problem they have had a constant 25-30% higher average load of outgoing traffic 100 to 110 Mbit, and incoming traffic of 80-90 Mbit. A pfstat graph show a packet rate that is not over 15000 in any direction. Apr 11 09:32:16 XXXXXX /bsd: WARNING: mclpool limit reached; increase kern.maxclusters On the list we have seen people raised kern.maxclusters values to over 65000 without success (the fw just lasts longer) and later got info that they had a driver bug (xl for example). I unfortunately don't have a "netstat-m" or "vmstat -m|grep mcl" before the crash but assume I would not be happy to see the result of the output. We have now however raised this value to 65000 and have not seen this happening again. But we don't know if it will come back when the load goes up. We have also added some vmstat and netstat stuff below my autosignature with info *after* this value was raised to 65000. It seems like the peak value of "netstat -m" just grows. Could this be a bug? #Question:# This problem is *hopefully* caused by a high network load and therefor only needs tuning rather than an os problem. A sysctl -a | grep kern.maxclusters shows the default: kern.maxclusters=6144 (this was before we raised it to 65000) What is a reasonable value for kern.maxclusters in a situation like this? (We ask as we don't want to raise it to high as we also are afraid of eventual side effects.) Additional info.... When the servers died, the load peak described above was already over (see the link below). Any good reason why they died when the load was back at standard load? see http://www.flowsystems.se/~sjoholmp/pfstat.jpg Thanks in advance Per-Olov Sjöholm -- GPG keyID: 4DB283CE GPG fingerprint: 45E8 3D0E DE05 B714 D549 45BC CFB4 BBE9 4DB2 83CE " <snipp> Thu Apr 13 16:35:01 CEST 2006 1839 mbufs in use: 1771 mbufs allocated to data 62 mbufs allocated to packet headers 6 mbufs allocated to socket names and addresses 1769/1800/65536 mbuf clusters in use (current/peak/max) 4072 Kbytes allocated to network (98% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines Thu Apr 13 16:40:01 CEST 2006 2495 mbufs in use: 2360 mbufs allocated to data 132 mbufs allocated to packet headers 3 mbufs allocated to socket names and addresses 2359/10866/65536 mbuf clusters in use (current/peak/max) 23540 Kbytes allocated to network (22% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines <snipp> Sun Apr 16 19:45:01 CEST 2006 1349 mbufs in use: 1289 mbufs allocated to data 57 mbufs allocated to packet headers 3 mbufs allocated to socket names and addresses 1288/10866/65536 mbuf clusters in use (current/peak/max) 22104 Kbytes allocated to network (13% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines Sun Apr 16 19:50:01 CEST 2006 24199 mbufs in use: 24072 mbufs allocated to data 123 mbufs allocated to packet headers 4 mbufs allocated to socket names and addresses 24071/26176/65536 mbuf clusters in use (current/peak/max) 58688 Kbytes allocated to network (20% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines Sun Apr 16 19:55:01 CEST 2006 1397 mbufs in use: 1300 mbufs allocated to data 93 mbufs allocated to packet headers 4 mbufs allocated to socket names and addresses 1302/28222/65536 mbuf clusters in use (current/peak/max) 56936 Kbytes allocated to network (5% in use) 0 requests for memory denied 0 requests for memory delayed 0 calls to protocol drain routines <snipp> " vmstat: " Thu Apr 13 13:05:01 CEST 2006 Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle mclpl 2048 169393794 0 169392503 807 0 807 807 4 32768 159 <snipp> Tue Apr 18 09:55:01 CEST 2006 Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle mclpl 2048 2757353733 0 2757352434 14111 0 14111 14111 4 32768 13459 "