pfil/Solaris 9 problems: 2.1.9 route cache timeouts & 2.1.11 instability

Robin Breathe Wed, 16 Aug 2006 01:46:06 -0700

Hi,

We've deployed ipfilter 4.1.13 with pfil 2.1.9 across a set of
multi-homed servers running Solaris 9 (kernel patch 118558-28) with
the goal of resolving various management/service issues with
policy-routing courtesy of "fastroute".
Static routes ensure that locally generated traffic to certain hosts
goes over the management interface, and ipfilter fastroute
policy-routing ensures that all other connections are responded to via
the interface they arrived over.
We have this working beautifully (even with partial ipmp support - see
my other recent message) using the following configuration (pfil 2.1.9):


# uname -a
SunOS cisdb3 5.9 Generic_118558-28 sun4u sparc SUNW,Netra-T12

# # ndd -get /dev/pfil qif_status
ifname ill q OTHERQ ipmp num sap hl nr nw bad copy copyfail drop notip
nodata notdata
mp0 0x0 0x0 0x0 0x0 0 800 0 0 0 0 0 0 0 0 0 0
QIF5 0x0 0x300017e6cf0 0x300017e6de0 0x0 5 806 0 50006 2338 0 0 0 0 0 0 0
bge2 0x3000169d938 0x300016aef68 0x300016af058 0x0 4 800 14 606050 87682
0 0 0 0 0 0 0
QIF3 0x0 0x300017e7c50 0x300017e7d40 0x0 3 806 0 1160757 1613 0 0 0 0 0 0 0
bge1 0x30000074cb0 0x30001c12530 0x30001c12620 0x3000174b230 2 800 14
672135 672135 0 0 0 0 0 0 0
QIF1 0x0 0x30001743498 0x30001743588 0x0 1 806 0 1160754 6836 0 0 0 0 0 0 0
bge0 0x30000074a30 0x30001c12f70 0x30001c13060 0x3000174b230 0 800 14
921640688 446885113 0 0 0 0 0 0 0


# ndd -get /dev/pfil qif_ipmp_status
ifname members
mp0 bge0,bge1


# cat ipf.conf
pass in  all head 1
pass out all head 2
  pass out log quick on mp0 to bge2:A.B.218.31 from A.B.218.0/24 to any
group 2
  pass out log quick on bge2 to mp0:10.0.30.250 from 10.0.30.0/24 to any
group 2


# # ipfstat -h -o
259343167 pass out all head 200
32314 pass out log quick on mp0 to bge2:161.73.218.31 from
161.73.218.0/24 to any group 200
0 pass out log quick on bge2 to mp0:10.0.30.250 from 10.0.30.0/24 to any
group 200


However, we are running into real problems with ARP/route cache timeouts
on the non-default interface, which lead to excessive numbers of
"Fastroute failures" causing both active and idle ssh sessions to die.

# ipfstat
bad packets:            in 0    out 0
 IPv6 packets:          in 0 out 0
 input packets:         blocked 0 passed 250226616 nomatch 0 counted 0
short 0
output packets:         blocked 0 passed 341997427 nomatch 0 counted 0
short 0
 input packets logged:  blocked 0 passed 82892
output packets logged:  blocked 0 passed 238008
 packets logged:        input 0 output 0
 log failures:          input 43277 output 47770
fragment state(in):     kept 0  lost 0  not fragmented 0
fragment state(out):    kept 0  lost 0  not fragmented 0
packet state(in):       kept 0  lost 0
packet state(out):      kept 7  lost 6918
ICMP replies:   0       TCP RSTs sent:  0
Invalid source(in):     0
Result cache hits(in):  171982217       (out):  143118437
IN Pullups succeeded:   326     failed: 0
OUT Pullups succeeded:  532609  failed: 0
Fastroute successes:    136053  failures:       36751
TCP cksum fails(in):    0       (out):  0
IPF Ticks:      4667479
Packet log flags set: (0x20000000)
        packets blocked by filter


Everything appears to work perfectly so long as the gateway, A.B.218.31,
resides in the route cache (`ndd -get /dev/ip ipv4_ire_status`) with a
type of "CACHE". However, this times out (period determined by `ndd -get
/dev/ip ip_ire_arp_interval` ?), and ipfstat(8) "Fastroute
failures" rapidly start incrementing: both active (and idle) sessions
are dropped, and connectivity (via the non-default route) is lost. It's
necessary to either wait a (seemingly) random amount of time or manually
ping the gateway from the afflicted host in order to reinstate the route
cache entry.
Searching the MARC archives, I came across a suggestion from Darren Reed
indicating that the problem was related to ARP cache timeouts, and
should be resolved by adding a static ARP entry (`arp -s A.B.218.31
00:01:02:03:04:05`) for the (non-default) gateway - but that doesn't
seem to help in this case. Likewise adding a static route (`route add
-host A.B.218.31 A.B.218.13 -interface`) has no effect either on its own
or with the static arp entry - the timeouts still occur in exactly the
same fashion.

Further Googling suggested that major work was ongoing within Solaris 10
to rework the networking stack and simplify the integrated pfil/ipf
stack. Looking at the changes made in pfil 2.1.11, it appeared that a
fix might be included. However, upon upgrading to the new version, and
doing a `reboot -- -r`, we've run into serious stability problems.

Variously, the following is logged to /var/adm/messages:
  gld: [ID 589725 kern.warning] WARNING: gld_start: rejected outbound
packet, size 10, max 1514
  last message repeated 1384190 times

...and the following message is repeated a few million times on the console:

Message overflow on /dev/log minor #6 -- is syslogd(1M) running?

These messages will repeat a few million times, during which time
network connectivity is lost, but eventually they stop and connectivity
is restored. This will repeat indefinitely, with a seemingly random
interval - the first can be anything from 1 second after boot to a few
hours. However, while "stable" the ARP cache timeout issue does *seem*
to be gone.

Any ideas what change in pfil 2.1.11 could have lead to this
instability, ideas for a fix, a backport of the timeout fix to 2.1.9 or
a new 2.1.12 release?

Additional details available on request.

NB: whether pfil ipmp sets are used or not makes no difference.

Regards,
Robin
-- 
Robin Breathe, Computer Services, Oxford Brookes University, Oxford, UK
[EMAIL PROTECTED]       Tel: +44 1865 483685  Fax: +44 1865 483073

signature.asc
Description: OpenPGP digital signature

pfil/Solaris 9 problems: 2.1.9 route cache timeouts & 2.1.11 instability

Reply via email to