Re: Kernel Panic

2018-02-26 Thread Kristof Provost

On 26 Feb 2018, at 17:06, Joe Jones wrote:

Hi Kristof,

we are not updating rules during the test although in production we 
will reload the rule set from time to time. We are constantly adding 
and removing from tables though, using the  DIOCRADDADDRS and 
DIOCRDELADDRS ioctl, also DIOCKILLSTATES is being called a lot. These 
are all in response to RADIUS events. We tried using pfctl shell 
command rather than calling ioctl directly, to check that it wasn't a 
problem with how we are calling the ioctl.



That’s interesting.

The panic leads me to suspect something’s wrong with the 
kt->pfrkt_ipv4->rt, which would explain why we get the unexpected NULL 
result.
My first guess at the cause would be a race condition, where it’s 
being modified (through one of the ioctls you do) while the 
pfr_pool_get() is walking it.


I don’t immediately see where that’d happen though, because both 
DIOCRADDADDRS and DIOCRDELADDRS take the rules lock (and pfr_pool_get() 
takes it too).



It might be interesting to run this with these extra asserts (and be 
sure to enable INVARIANTS).


diff --git a/sys/netpfil/pf/pf_table.c b/sys/netpfil/pf/pf_table.c
index 18342a94073..cad9b4ea89f 100644
--- a/sys/netpfil/pf/pf_table.c
+++ b/sys/netpfil/pf/pf_table.c
	@@ -962,6 +962,8 @@ pfr_unroute_kentry(struct pfr_ktable *kt, struct 
pfr_kentry *ke)

struct radix_node   *rn;
struct radix_head   *head = NULL;

+   PF_RULES_WASSERT();
+
if (ke->pfrke_af == AF_INET)
head = &kt->pfrkt_ip4->rh;
else if (ke->pfrke_af == AF_INET6)
	@@ -1855,6 +1859,8 @@ pfr_destroy_ktable(struct pfr_ktable *kt, int 
flushaddr)

 {
struct pfr_kentryworkq   addrq;

+   PF_RULES_WASSERT();
+
if (flushaddr) {
pfr_enqueue_addrs(kt, &addrq, NULL, 0);
pfr_clean_node_mask(kt, &addrq);

Regards,
Kristof
___
freebsd-pf@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-pf
To unsubscribe, send any mail to "freebsd-pf-unsubscr...@freebsd.org"


Re: Kernel Panic

2018-02-26 Thread Joe Jones

Hi Kristof,

we are not updating rules during the test although in production we will 
reload the rule set from time to time. We are constantly adding and 
removing from tables though, using the  DIOCRADDADDRS and DIOCRDELADDRS 
ioctl, also DIOCKILLSTATES is being called a lot. These are all in 
response to RADIUS events. We tried using pfctl shell command rather 
than calling ioctl directly, to check that it wasn't a problem with how 
we are calling the ioctl.


A little background. Our production system is running on 8.4 and has 
been stable for years. We are in the process of moving to 11.1 and are 
having big problems with stability when we allow customer traffic into 
the machine. At the moment we are using mirror ports on the switch to 
play live traffic into it. We're trying to work out the simplest 
configuration that causes a problem with a view to producing a good bug 
report.


I have notices that the pfil interface 
https://www.freebsd.org/cgi/man.cgi?query=pfil&sektion=9 has locking in 
it which didn't exist in 8, I think it was introduced in 9? the locking 
functions appear in the man page in 10. I don't know if that interface 
is used directly by pf, but I'm guessing packet processing needs to be 
thread safe in a way it didn't in 8.



Regards

Joe Jones

On 25/02/18 10:56, Kristof Provost wrote:

On 14 Feb 2018, at 19:57, Joe Jones wrote:

On 14/02/18 13:09, Kristof Provost wrote:

On 14 Feb 2018, at 23:47, Joe Jones wrote:
we are running test traffic through our system, after between 1 and 
12 hours we get a kernel panic, always in the pfr_pool_get function 
in /usr/src/sys/netpfil/pf/pf_table.c line 2140. After a bit of 
investigation I confirmed that ke2 is set to null on line 2122.


It’d probably be interesting to know what the contents of uaddr/addr 
is here.
From a very quick look at the code there’s supposed to be a route 
lookup there, and I’d expect there to always be a result. The code 
certainly expects it, because that looks to be what causes the panic.




(kgdb) p *uaddr
No symbol "uaddr" in current context.

(kgdb) p *addr
$1 = {
  pfa = {
v4 = {
  s_addr = 2016475826
},
v6 = {
  __u6_addr = {
__u6_addr8 = 0xfe310d0c "��0x0\r1",
__u6_addr16 = 0xfe310d0c,
__u6_addr32 = 0xfe310d0c
  }
},
addr8 = 0xfe310d0c "��0x0\r1",
addr16 = 0xfe310d0c,
addr32 = 0xfe310d0c
  }
}

Interesting… That looks okay, so I have no idea why that lookup 
returned NULL.

Are you modifying tables/rules at all during this test?


Am I right in thinking that's in network order.


I believe so, yes.

Regards,
Kristof


___
freebsd-pf@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-pf
To unsubscribe, send any mail to "freebsd-pf-unsubscr...@freebsd.org"