Re: Panic with pf 'once' rule

2019-07-07 Thread Mike Belopuhov


Aaron Bieber writes:

> Hi,
>
> Adding a rule similar to the below causes a panic on -current (OpenBSD
> 6.5-current (GENERIC) #95: Thu Jul  4 21:22:25 MDT 2019). This also panics 6.3
> and 6.5 (I didn't test 6.4):
>
>   pass in quick on egress proto tcp from any to port  once rdr-to \
>   127.0.0.1 port 
>
> Once the rule is in place, fire up:
>
>   nc -l 127.0.0.1 
>
> Connect a few times from a remote machine:
>   
>   nc  
>
> Eventually it will panic with (sometimes it happens right away, other times I
> have to restart nc a few times):
>

This is because it's meant to be used inside of an anchor
(it removes the rule once it's matched).

The most sensible way to use it is to put it into the anchor
inside a recursive anchor (e.g. 'relayd/*').

It's possible that the check protecting the system from
the misuse like you've described here got lost during
refactoring or it never existed in the first place :-(

Cheers,
Mike

>   ddb> trace
>   pf_rm_rule(81d900a8,803bbfe8) at pf_rm_rule+0xa9
>   pf_purge_rule(803bbfe8) at pf_purge_rule+0x26
>   pf_purge(81dc1088) at pf_purge+0x55
>   taskq_thread(80022040) at taskq_thread+0x3d
>   end trace frame: 0x0, count: -4
>   ddb>
>   ddb> ps
>  PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>69502   12189  1  0  30x100083  ttyin ksh
>53972  340673  1  0  30x100098  poll  cron
>81827  279222  1110  30x100090  poll  sndiod
>54852   68160  1 99  30x100090  poll  sndiod
>79474   94554   3215 95  30x100092  kqreadsmtpd
>90212  164878   3215103  30x100092  kqreadsmtpd
>43199  482512   3215 95  30x100092  kqreadsmtpd
>38765  100663   3215 95  30x100092  kqreadsmtpd
>33241  424770   3215 95  30x100092  kqreadsmtpd
> 5338  193750   3215 95  30x100092  kqreadsmtpd
> 3215  481909  1  0  30x100080  kqreadsmtpd
>57742  143403  1  0  30x80  selectsshd
>31904  460143  1  0  30x100080  poll  ntpd
>65592  182120  47006 83  30x100092  poll  ntpd
>47006  103509  1 83  30x100092  poll  ntpd
>60875  292765  99617 74  30x100092  bpf   pflogd
>99617  524148  1  0  30x80  netio pflogd
> 4242  324170  49064 73  30x100090  kqreadsyslogd
>49064  413359  1  0  30x100082  netio syslogd
>20955  102995  68995115  30x100092  kqreadslaacd
>99883  518930  68995115  30x100092  kqreadslaacd
>68995  175540  1  0  30x100080  kqreadslaacd
> 5278  238159  0  0  3 0x14200  pgzerozerothread
> 2253  479921  0  0  3 0x14200  aiodoned  aiodoned
>98149  310276  0  0  3 0x14200  syncerupdate
>78055  259911  0  0  3 0x14200  cleaner   cleaner
>68827  324781  0  0  3 0x14200  reaperreaper
>93269   98863  0  0  3 0x14200  pgdaemon  pagedaemon
>75284  447451  0  0  3 0x14200  bored crynlk
>34868  513191  0  0  3 0x14200  bored crypto
>   *18776  255193  0  0  7 0x14200softnet
>64918  469356  0  0  3 0x14200  bored systqmp
>  902   49537  0  0  3 0x14200  bored systq
>17250  200730  0  0  3  0x40014200  bored softclock
> 2990  510299  0  0  3  0x40014200idle0
>  947  215447  0  0  3 0x14200  bored smr
>1  180680  0  0  30x82  wait  init
>0   0 -1  0  3 0x10200  scheduler swapper
>   ddb> 
>
> dmesg (from a VM in vmm - I have also reproduced this on physical hw):
> OpenBSD 6.5-current (GENERIC) #95: Thu Jul  4 21:22:25 MDT 2019
> dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
> real mem = 4278181888 (4079MB)
> avail mem = 4138524672 (3946MB)
> mpath0 at root
> scsibus0 at mpath0: 256 targets
> mainbus0 at root
> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf3f10 (12 entries)
> bios0: vendor SeaBIOS version "1.11.0p2-OpenBSD-vmm" date 01/01/2011
> bios0: OpenBSD VMM
> acpi at bios0 not configured
> cpu0 at mainbus0: (uniprocessor)
> cpu0: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx, 37466.79 MHz, 17-11-00
> cpu0: 
> FPU,VME,DE,PSE,TSC,MSR,PAE,CX8,SEP,PGE,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,PCLMUL,SSSE3,FMA3,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AES,XSAVE,AVX,F16C,RDRAND,HV,NXE,MMXX,FFXSR,PAGE1GB,LONG,LAHF,CMPLEG,EAPICSP,AMCR8,ABM,SSE4A,MASSE,3DNOWP,OSVW,SKINIT,TCE,TOPEXT,CPCTR,DBKP,PCTRL3,ITSC,FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SH

Re: Panic with pf 'once' rule

2019-07-09 Thread Aaron Bieber
I get the panic when the rule is in an anchor as well.

PGP: 0x1F81112D62A9ADCE / 3586 3350 BFEA C101 DB1A 4AF0 1F81 112D 62A9 ADCE

>> On Jul 7, 2019, at 4:39 PM, Mike Belopuhov  wrote:
> 
> Aaron Bieber writes:
> 
>> Hi,
>> 
>> Adding a rule similar to the below causes a panic on -current (OpenBSD
>> 6.5-current (GENERIC) #95: Thu Jul  4 21:22:25 MDT 2019). This also panics 
>> 6.3
>> and 6.5 (I didn't test 6.4):
>> 
>>  pass in quick on egress proto tcp from any to port  once rdr-to \
>>127.0.0.1 port 
>> 
>> Once the rule is in place, fire up:
>> 
>>  nc -l 127.0.0.1 
>> 
>> Connect a few times from a remote machine:
>> 
>>  nc  
>> 
>> Eventually it will panic with (sometimes it happens right away, other times I
>> have to restart nc a few times):
> 
> This is because it's meant to be used inside of an anchor
> (it removes the rule once it's matched).
> 
> The most sensible way to use it is to put it into the anchor
> inside a recursive anchor (e.g. 'relayd/*').
> 
> It's possible that the check protecting the system from
> the misuse like you've described here got lost during
> refactoring or it never existed in the first place :-(
> 
> Cheers,
> Mike
> 
>>  ddb> trace
>>  pf_rm_rule(81d900a8,803bbfe8) at pf_rm_rule+0xa9
>>  pf_purge_rule(803bbfe8) at pf_purge_rule+0x26
>>  pf_purge(81dc1088) at pf_purge+0x55
>>  taskq_thread(80022040) at taskq_thread+0x3d
>>  end trace frame: 0x0, count: -4
>>  ddb>
>>  ddb> ps
>> PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>>   69502   12189  1  0  30x100083  ttyin ksh
>>   53972  340673  1  0  30x100098  poll  cron
>>   81827  279222  1110  30x100090  poll  sndiod
>>   54852   68160  1 99  30x100090  poll  sndiod
>>   79474   94554   3215 95  30x100092  kqreadsmtpd
>>   90212  164878   3215103  30x100092  kqreadsmtpd
>>   43199  482512   3215 95  30x100092  kqreadsmtpd
>>   38765  100663   3215 95  30x100092  kqreadsmtpd
>>   33241  424770   3215 95  30x100092  kqreadsmtpd
>>5338  193750   3215 95  30x100092  kqreadsmtpd
>>3215  481909  1  0  30x100080  kqreadsmtpd
>>   57742  143403  1  0  30x80  selectsshd
>>   31904  460143  1  0  30x100080  poll  ntpd
>>   65592  182120  47006 83  30x100092  poll  ntpd
>>   47006  103509  1 83  30x100092  poll  ntpd
>>   60875  292765  99617 74  30x100092  bpf   pflogd
>>   99617  524148  1  0  30x80  netio pflogd
>>4242  324170  49064 73  30x100090  kqreadsyslogd
>>   49064  413359  1  0  30x100082  netio syslogd
>>   20955  102995  68995115  30x100092  kqreadslaacd
>>   99883  518930  68995115  30x100092  kqreadslaacd
>>   68995  175540  1  0  30x100080  kqreadslaacd
>>5278  238159  0  0  3 0x14200  pgzerozerothread
>>2253  479921  0  0  3 0x14200  aiodoned  aiodoned
>>   98149  310276  0  0  3 0x14200  syncerupdate
>>   78055  259911  0  0  3 0x14200  cleaner   cleaner
>>   68827  324781  0  0  3 0x14200  reaperreaper
>>   93269   98863  0  0  3 0x14200  pgdaemon  pagedaemon
>>   75284  447451  0  0  3 0x14200  bored crynlk
>>   34868  513191  0  0  3 0x14200  bored crypto
>>  *18776  255193  0  0  7 0x14200softnet
>>   64918  469356  0  0  3 0x14200  bored systqmp
>> 902   49537  0  0  3 0x14200  bored systq
>>   17250  200730  0  0  3  0x40014200  bored softclock
>>2990  510299  0  0  3  0x40014200idle0
>> 947  215447  0  0  3 0x14200  bored smr
>>   1  180680  0  0  30x82  wait  init
>>   0   0 -1  0  3 0x10200  scheduler swapper
>>  ddb> 
>> 
>> dmesg (from a VM in vmm - I have also reproduced this on physical hw):
>> OpenBSD 6.5-current (GENERIC) #95: Thu Jul  4 21:22:25 MDT 2019
>>dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC
>> real mem = 4278181888 (4079MB)
>> avail mem = 4138524672 (3946MB)
>> mpath0 at root
>> scsibus0 at mpath0: 256 targets
>> mainbus0 at root
>> bios0 at mainbus0: SMBIOS rev. 2.4 @ 0xf3f10 (12 entries)
>> bios0: vendor SeaBIOS version "1.11.0p2-OpenBSD-vmm" date 01/01/2011
>> bios0: OpenBSD VMM
>> acpi at bios0 not configured
>> cpu0 at mainbus0: (uniprocessor)
>> cpu0: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx, 37466.79 MHz, 17-11-00
>> cpu0: 
>> FPU,VME,DE,PSE,TSC,MSR,PAE,CX8,SEP,PGE,CMOV,PAT,PSE36,CFLUSH,MMX,FXSR,SSE,SSE2,SSE3,PCL

Re: Panic with pf 'once' rule

2019-07-12 Thread Lawrence Teo
On Tue, Jul 09, 2019 at 03:35:26PM -0600, Aaron Bieber wrote:
> I get the panic when the rule is in an anchor as well.
> 
> PGP: 0x1F81112D62A9ADCE / 3586 3350 BFEA C101 DB1A 4AF0 1F81 112D 62A9 ADCE
> 
> >> On Jul 7, 2019, at 4:39 PM, Mike Belopuhov  wrote:
> > ???
> > Aaron Bieber writes:
> > 
> >> Hi,
> >> 
> >> Adding a rule similar to the below causes a panic on -current (OpenBSD
> >> 6.5-current (GENERIC) #95: Thu Jul  4 21:22:25 MDT 2019). This also panics 
> >> 6.3
> >> and 6.5 (I didn't test 6.4):
> >> 
> >>  pass in quick on egress proto tcp from any to port  once rdr-to \
> >>127.0.0.1 port 
> >> 
> >> Once the rule is in place, fire up:
> >> 
> >>  nc -l 127.0.0.1 
> >> 
> >> Connect a few times from a remote machine:
> >> 
> >>  nc  
> >> 
> >> Eventually it will panic with (sometimes it happens right away, other 
> >> times I
> >> have to restart nc a few times):
> > 
> > This is because it's meant to be used inside of an anchor
> > (it removes the rule once it's matched).
> > 
> > The most sensible way to use it is to put it into the anchor
> > inside a recursive anchor (e.g. 'relayd/*').
> > 
> > It's possible that the check protecting the system from
> > the misuse like you've described here got lost during
> > refactoring or it never existed in the first place :-(
> > 
> > Cheers,
> > Mike
> > 
> >>  ddb> trace
> >>  pf_rm_rule(81d900a8,803bbfe8) at pf_rm_rule+0xa9
> >>  pf_purge_rule(803bbfe8) at pf_purge_rule+0x26
> >>  pf_purge(81dc1088) at pf_purge+0x55
> >>  taskq_thread(80022040) at taskq_thread+0x3d
> >>  end trace frame: 0x0, count: -4

[cc'ing sashan@ because this relates to pf.c r1.983]

I can replicate the crash as well.  The crash only happens if more than
one connection that matches the once rule is made in a short period of
time.  abieber@ triggered this when he ran "nc  " a few times.
After going through the code, I think I found the cause and wrote a
possible fix.

As of pf.c r1.983, once rules are removed by the purge thread instead of
immediately when a packet matches the rule.  The latest code is in
pf_test_rule():

if (r->rule_flag & PFRULE_ONCE) {
r->rule_flag |= PFRULE_EXPIRED;
r->exptime = time_second;
SLIST_INSERT_HEAD(&pf_rule_gcl, r, gcle);
}

When a once rule is matched by a packet for the first time, the rule is
marked as expired and added to pf_rule_gcl.  Eventually the purge thread
is run, which calls pf_purge_expired_rules() to purge the expired rules
in pf_rule_gcl:

while ((r = SLIST_FIRST(&pf_rule_gcl)) != NULL) {
SLIST_REMOVE(&pf_rule_gcl, r, pf_rule, gcle);
KASSERT(r->rule_flag & PFRULE_EXPIRED);
pf_purge_rule(r);
}

Since the purge thread only runs periodically, there is a short time
window where a once rule is expired but not yet purged.  If a second new
connection that matches the once rule is made in that time window, the
pf_test_rule() code block is executed again, causing the same once rule
to be added to pf_rule_gcl again.

As a result, when the purge thread finally runs, it tries to remove the
once rule multiple times, resulting in a crash, as seen on the
backtrace:

pf_rm_rule(81d900a8,803bbfe8) at pf_rm_rule+0xa9
pf_purge_rule(803bbfe8) at pf_purge_rule+0x26
pf_purge(81dc1088) at pf_purge+0x55
taskq_thread(80022040) at taskq_thread+0x3d

There is also a related bug where pf_test_rule() still creates state for
new connections that match the expired once rule in that time window,
even though the once rule has already expired.  In other words, the once
rule is not truly a one shot rule in that expired-but-not-purged time
window.

To fix both bugs, I wrote this diff that adds a check in pf_test_rule()
to prevent expired once rules from being added to pf_rule_gcl.  The
check is added "early" in pf_test_rule() to prevent any further
connections from creating state if they match the expired once rule.

I have sent this diff to abieber@ who confirmed that it fixes the crash
for him.

Thoughts? ok?


Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1084
diff -u -p -r1.1084 pf.c
--- pf.c9 Jul 2019 11:30:19 -   1.1084
+++ pf.c12 Jul 2019 02:54:33 -
@@ -3862,6 +3862,13 @@ pf_test_rule(struct pf_pdesc *pd, struct
if (r->action == PF_DROP)
goto cleanup;
 
+   /*
+* If an expired "once" rule has not been purged, drop any new matching
+* packets to prevent the rule from being added to pf_rule_gcl again.
+*/
+   if (r->rule_flag & PFRULE_EXPIRED)
+   goto cleanup;
+
pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
if (ctx.act.rtableid >= 0 &&
rtable_l2(ctx.act.rtableid) != pd->rdomain)



Re: Panic with pf 'once' rule

2019-07-15 Thread Alexandr Nedvedicky
Hello,


> 
> To fix both bugs, I wrote this diff that adds a check in pf_test_rule()
> to prevent expired once rules from being added to pf_rule_gcl.  The
> check is added "early" in pf_test_rule() to prevent any further
> connections from creating state if they match the expired once rule.
> 
> I have sent this diff to abieber@ who confirmed that it fixes the crash
> for him.
> 
> Thoughts? ok?
> 

thanks for detailed analysis. it is correct.
> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1084
> diff -u -p -r1.1084 pf.c
> --- pf.c  9 Jul 2019 11:30:19 -   1.1084
> +++ pf.c  12 Jul 2019 02:54:33 -
> @@ -3862,6 +3862,13 @@ pf_test_rule(struct pf_pdesc *pd, struct
>   if (r->action == PF_DROP)
>   goto cleanup;
>  
> + /*
> +  * If an expired "once" rule has not been purged, drop any new matching
> +  * packets to prevent the rule from being added to pf_rule_gcl again.
> +  */
> + if (r->rule_flag & PFRULE_EXPIRED)
> + goto cleanup;
> +
>   pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
>   if (ctx.act.rtableid >= 0 &&
>   rtable_l2(ctx.act.rtableid) != pd->rdomain)

your change looks good and makes sense. However I would ask for one more
tweak to your fix:

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 9e454e5c941..df75d7e4acf 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -3860,6 +3860,13 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
if (r->action == PF_DROP)
goto cleanup;
 
+   /*
+* If an expired "once" rule has not been purged, drop any new matching
+* packets.
+*/
+   if (r->rule_flag & PFRULE_EXPIRED)
+   goto cleanup;
+
pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
if (ctx.act.rtableid >= 0 &&
rtable_l2(ctx.act.rtableid) != pd->rdomain)
@@ -3945,9 +3952,18 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
 #endif /* NPFSYNC > 0 */
 
if (r->rule_flag & PFRULE_ONCE) {
-   r->rule_flag |= PFRULE_EXPIRED;
-   r->exptime = time_second;
-   SLIST_INSERT_HEAD(&pf_rule_gcl, r, gcle);
+   u_int32_t   rule_flag;
+
+   /*
+* Use atomic_cas() to determine a clear winner, which will
+* insert an expired rule to gcl.
+*/
+   rule_flag = r->rule_flag;
+   if (atomic_cas_uint(&r->rule_flag, rule_flag,
+   rule_flag | PFRULE_EXPIRED) == rule_flag) {
+   r->exptime = time_second;
+   SLIST_INSERT_HEAD(&pf_rule_gcl, r, gcle);
+   }
}
 
return (action);
8<---8<---8<--8<

As soon as there will be more PF tasks running in parallel, we would be
able to hit similar crash you are fixing now. The rules are covered by
read lock, so with more PF tasks there might be two packets racing
to expire at once rule at the same time. Using atomic_cas() is sufficient
measure to serialize competing packets.

you have my OK with/without the suggested tweak.

thanks and
regards
sashan



Re: Panic with pf 'once' rule

2019-07-17 Thread Lawrence Teo
On Mon, Jul 15, 2019 at 11:06:50AM +0200, Alexandr Nedvedicky wrote:
> your change looks good and makes sense. However I would ask for one more
> tweak to your fix:
> 
> 8<---8<---8<--8<
> diff --git a/sys/net/pf.c b/sys/net/pf.c
> index 9e454e5c941..df75d7e4acf 100644
> --- a/sys/net/pf.c
> +++ b/sys/net/pf.c
> @@ -3860,6 +3860,13 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
> struct pf_state **sm,
> if (r->action == PF_DROP)
> goto cleanup;
>  
> +   /*
> +* If an expired "once" rule has not been purged, drop any new 
> matching
> +* packets.
> +*/
> +   if (r->rule_flag & PFRULE_EXPIRED)
> +   goto cleanup;
> +
> pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
> if (ctx.act.rtableid >= 0 &&
> rtable_l2(ctx.act.rtableid) != pd->rdomain)
> @@ -3945,9 +3952,18 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
> struct pf_state **sm,
>  #endif /* NPFSYNC > 0 */
>  
> if (r->rule_flag & PFRULE_ONCE) {
> -   r->rule_flag |= PFRULE_EXPIRED;
> -   r->exptime = time_second;
> -   SLIST_INSERT_HEAD(&pf_rule_gcl, r, gcle);
> +   u_int32_t   rule_flag;
> +
> +   /*
> +* Use atomic_cas() to determine a clear winner, which will
> +* insert an expired rule to gcl.
> +*/
> +   rule_flag = r->rule_flag;
> +   if (atomic_cas_uint(&r->rule_flag, rule_flag,
> +   rule_flag | PFRULE_EXPIRED) == rule_flag) {
> +   r->exptime = time_second;
> +   SLIST_INSERT_HEAD(&pf_rule_gcl, r, gcle);
> +   }
> }
>  
> return (action);
> 8<---8<---8<--8<
> 
> As soon as there will be more PF tasks running in parallel, we would be
> able to hit similar crash you are fixing now. The rules are covered by
> read lock, so with more PF tasks there might be two packets racing
> to expire at once rule at the same time. Using atomic_cas() is sufficient
> measure to serialize competing packets.
> 
> you have my OK with/without the suggested tweak.

Thank you!  I have committed the fix with your tweak.

Lawrence