Re: relayd panic

2022-06-05 Thread Alexandr Nedvedicky
Hello,

I'll commit one-liner diff on Tuesday morning (Jun 6th) Prague time without
explicit OK, unless there will be no objection.

regards
sashan


On Sun, Jun 05, 2022 at 09:44:45AM +0100, Stuart Henderson wrote:
> I don't know this code well enough to give a meaningful OK, but this
> should probably get committed.
> 
> 
> On 2022/06/01 09:16, Alexandr Nedvedicky wrote:
> > Hello,
> > 
> > 
> > > r420-1# rcctl -f start relayd
> > > relayd(ok)
> > > r420-1# uvm_fault(0xfd862f82f990, 0x0, 0, 1) -> e
> > > kernel: page fault trap, code=0
> > > Stopped at  pf_find_or_create_ruleset+0x1c: movb0(%rdi),%al
> > > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > >  431388  19003  0 0x2  05  relayd
> > >  174608  32253 89   0x112  02  relayd
> > >  395415  12468  0 0x2  04  relayd
> > >  493579  11904  0 0x2  03  relayd
> > > *101082  14967 89   0x1100012  00K relayd
> > > pf_find_or_create_ruleset(0) at pf_find_or_create_ruleset+0x1c
> > > pfr_add_tables(832d7cca800,1,80eaf43c,1000) at
> > > pfr_add_tables+0x6ae
> > > 
> > > pfioctl(4900,c450443d,80eaf000,3,80002272e7f0) at 
> > > pfioctl+0x1d9f
> > > VOP_IOCTL(fd8551f82dd0,c450443d,80eaf000,3,fd862f7d60c0,800
> > > 02272e7f0) at VOP_IOCTL+0x5c
> > > vn_ioctl(fd855ecec1e8,c450443d,80eaf000,80002272e7f0) at
> > > vn_ioctl+0x75
> > > sys_ioctl(80002272e7f0,8000227d9980,8000227d99d0) at
> > > sys_ioctl+0x2c4
> > > syscall(8000227d9a40) at syscall+0x374
> > > Xsyscall() at Xsyscall+0x128
> > > end of kernel
> > 
> > it looks like we are dying here at line 239 due to NULL pointer 
> > deference:
> > 
> > 232 struct pf_ruleset *
> > 233 pf_find_or_create_ruleset(const char *path)
> > 234 {
> > 235 char*p, *aname, *r;
> > 236 struct pf_ruleset   *ruleset;
> > 237 struct pf_anchor*anchor;
> > 238 
> > 239 if (path[0] == 0)
> > 240 return (_main_ruleset);
> > 241 
> > 242 while (*path == '/')
> > 243 path++;
> > 244 
> > 
> > I've followed the same steps to reproduce the issue to check if
> > diff below resolves the issue. The bug has been introduced by
> > my recent change to pf_table.c [1] from May 10th:
> > 
> > Modified files:
> > sys/net: pf_ioctl.c pf_table.c 
> > 
> > Log message:
> > move memory allocations in pfr_add_tables() out of
> > NET_LOCK()/PF_LOCK() scope. bluhm@ helped a lot
> > to put this diff into shape.
> > 
> > besides using a regression test I've also did simple testing
> > using a 'load anchor':
> > 
> > netlock# cat /tmp/anchor.conf   
> >
> > load anchor "test" from "/tmp/pf.conf"
> > netlock#
> > netlock# cat /tmp/pf.conf   
> >
> > table  { 192.168.1.1 }
> > pass from 
> > netlock#
> > netlock# pfctl -sA
> >   test
> > netlock# pfctl -a test -sT
> > try
> > netlock# pfctl -a test -t try -T show
> >192.168.1.1
> > 
> > OK to commit fix below?
> > 
> > thanks and
> > regards
> > sashan
> > 
> > [1] 
> > https://urldefense.com/v3/__https://marc.info/?l=openbsd-cvs=16522243003=2__;!!ACWV5N9M2RV99hQ!LsTJPPsMku6N_u9xzJu6Tj6XpZWyLzLWPmbWr-Z-p845Y8r6LH4Ul8PyX8EmqI6alhF0JqadpBBF4mn53v-rQdY$
> >  
> > 
> > 8<---8<---8<--8<
> > diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
> > index 8315ea5dd3a..dfc49de5efe 100644
> > --- a/sys/net/pf_table.c
> > +++ b/sys/net/pf_table.c
> > @@ -1628,8 +1628,7 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
> > *nadd, int flags)
> > if (r != NULL)
> > continue;
> >  
> > -   q->pfrkt_rs = pf_find_or_create_ruleset(
> > -   q->pfrkt_root->pfrkt_anchor);
> > +   q->pfrkt_rs = 
> > pf_find_or_create_ruleset(q->pfrkt_anchor);
> > /*
> >  * root tables are attached to main ruleset,
> >  * because ->pfrkt_anchor[0] == '\0'
> > 



pfctl reports back a table as being always created/added

2022-06-14 Thread Alexandr Nedvedicky
Hello,

Jason Mc Intyre (jmc@) reported a bug earlier today reaching me
by email off-list. Let me quote from Jason's email:

i already have a pf table, adding an address tells me i have created a
table. even though the table already existed:

# pfctl -tbrutes -Tshow | wc
89  89  501
# pfctl -tbrutes -Tadd 1.1.1.1
1 table created.
1/1 addresses added.

The bug has been introduced by my commit to pf_table.c:
$OpenBSD: pf_ioctl.c,v 1.381 2022/05/10 23:12:25 sashan Exp $

pfr_table_add() currently always increases 'xadd' counter while it
should increase 'xadd' if and only if we are going to create a table.
diff below fixes that.

OK to commit?


thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index f261baef963..6c47e11f604 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -1562,13 +1562,13 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
*nadd, int flags)
if (p == NULL) {
SLIST_REMOVE(, n, pfr_ktable, pfrkt_workq);
SLIST_INSERT_HEAD(, n, pfrkt_workq);
+   xadd++;
} else if (!(flags & PFR_FLAG_DUMMY) &&
!(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
p->pfrkt_nflags = (p->pfrkt_flags &
~PFR_TFLAG_USRMASK) | key.pfrkt_flags;
SLIST_INSERT_HEAD(, p, pfrkt_workq);
}
-   xadd++;
}
 
if (!(flags & PFR_FLAG_DUMMY)) {



Re: unlock pf_purge

2022-06-07 Thread Alexandr Nedvedicky
Hello,

I like this change and I think this should go in as-is.
This is OK sashan@

I also think we should revisit a pf.conf(5) manpage where
'interval' (PFTM_INTERVAL) is described. It currently
reads as follows:

 set timeout variable value
 frag   Seconds before an unassembled fragment is expired (60
by default).
 interval   Interval between purging expired states and fragments
(10 seconds by default).
 src.track  Length of time to retain a source tracking entry after
the last state expires (0 by default, which means
there is no global limit.  The value is defined by the
rule which creates the state.).

the pf(4) runs expiration timer for states every second regardless
how interval value is set. pf(4) uses/abuses interval value (PFTM_INTERVAL)
for state expiration to determine a fraction of states to be processed
by 1-sec timer.

I prefer to tune manpage in separate commit.

thanks and
OK sashan

On Tue, Jun 07, 2022 at 04:58:24PM +1000, David Gwynne wrote:
> the main change here is to move pf_purge out from under the kernel lock.
> 
> another part of the change is to limit the amount of work the state
> purging does to avoid hogging a cpu too much, and to also avoid holding
> NET_LOCK for too long.
> 
> the main part is acheived by using systqmp to run the purge tasks
> instead of systq. from what i can tell, all the data structures we're
> walking over are protected by pf locks and/or the net lock. the pf
> state list in particular was recently reworked so iteration over
> the list can be done without interfering with insertions. this means we
> can scan the state list to collect expired states without affecting
> packet processing.
> 
> however, if you have a lot of states (like me) you can still spend a
> lot of time scanning. i seem to sit at around 1 to 1.2 million states,
> and i run with the default scan interval of 10 seconds. that means im
> scanning about 100 to 120k states every second, which just takes time.
> 
> doing the scan while holding the kernel lock means most other stuff
> cannot run at the same time. worse, the timeout that schedules the pf
> purge work also seems to schedule something in the softclock thread. if
> the same timeout also wakes up a process in an unlocked syscall, a
> significant amount of time in the system is spent spinning for no
> reason. significant meaning dozens of milliseconds, which is actually
> noticable if you're working interactively on the box.
> 
> before this change pf purges often took 10 to 50ms. the softclock
> thread runs next to it often took a similar amount of time, presumably
> because they ended up spinning waiting for each other. after this
> change the pf_purges are more like 6 to 12ms, and dont block softclock.
> most of the variability in the runs now seems to come from contention on
> the net lock.
> 
> the state purge task now checks the SHOULDYIELD flag, so if the
> task is taking too long it will return early and requeue itself,
> allowing the taskq machinery to yield and let other threads run.
> 
> state purging is now also limited to removing 64 states per task run to
> avoid holding locks that would block packet processing for too long. if
> there's still states to scan in the interval after these 64 packets are
> collected, the task reschedules so it can keep scanning from where it
> left off.
> 
> the other purge tasks for source nodes, rules, and fragments have been
> moved to their own timeout/task pair to simplify the time accounting.
> 
> the box feels a lot more responsive with this change. i'm still kicking
> it around, but i wanted to get it out early.
> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1132
> diff -u -p -r1.1132 pf.c
> --- pf.c  23 May 2022 11:17:35 -  1.1132
> +++ pf.c  7 Jun 2022 06:49:51 -
> @@ -120,10 +120,6 @@ u_charpf_tcp_secret[16];
>  int   pf_tcp_secret_init;
>  int   pf_tcp_iss_off;
>  
> -int   pf_npurge;
> -struct task   pf_purge_task = TASK_INITIALIZER(pf_purge, _npurge);
> -struct timeoutpf_purge_to = TIMEOUT_INITIALIZER(pf_purge_timeout, 
> NULL);
> -
>  enum pf_test_status {
>   PF_TEST_FAIL = -1,
>   PF_TEST_OK,
> @@ -1276,49 +1272,111 @@ pf_purge_expired_rules(void)
>   }
>  }
>  
> -void
> -pf_purge_timeout(void *unused)
> -{
> - /* XXX move to systqmp to avoid KERNEL_LOCK */
> - task_add(systq, _purge_task);
> -}
> +void  pf_purge_states(void *);
> +struct task   pf_purge_states_task =
> +  TASK_INITIALIZER(pf_purge_states, NULL);
> +
> +void  pf_purge_states_tick(void *);
> +struct timeoutpf_purge_states_to =
> +  TIMEOUT_INITIALIZER(pf_purge_states_tick, NULL);
> +
> 

Re: pf: pool for pf_anchor

2022-07-20 Thread Alexandr Nedvedicky
Hello,

On Tue, Jul 19, 2022 at 10:58:08PM +0200, Alexander Bluhm wrote:
> On Tue, Jul 19, 2022 at 07:18:54PM +0200, Moritz Buhl wrote:
> > A solution would be to move the allocation of the pf_anchor struct
> > into a pool.  One final question would be regarding the size of the
> > hiwat or hardlimit.  Any suggestions?
> 
> 10 seems way to low.  We want to limit resource abuse.  But regular
> users should never hit this.  Especially existing configurations
> should still work after upgrade.
> 
> Perhaps 512 anchors.  Noone shoul ever need more than 512 anchors :-)

512 seems reasonable. All other options which come to my mind
require far more complex change. for example we might consider
to tight up those uncommitted  transactions to process. Then we
can release all uncommitted data in pfcloe() for such process.

> 
> > Any other comments?


> 
> Please do not use PR_WAITOK in pool_init() unless you know what you
> are doing.  I think it should work, but who knows.  The other pf
> pools have 0 flags, just use that.
> 
> Comment says rs_malloc is needed to compile pfctl.  Have you tested
> that?
> 
> Replacing M_WAITOK|M_CANFAIL|M_ZERO with PR_NOWAIT|PR_ZERO looks
> wrong.  PR_WAITOK|PR_LIMITFAIL|PR_ZERO should do something equivalent.
> 
> I am not sure if pf_find_anchor() should fail if the limit is hit.
> It uses only a temporary buffer.  All calls to pf_find_ruleset()
> should be checked whether permanent failure, after the limit has
> been reached, is what we want.  Or we keep the malloc there?
> 

this seems to be a good point. I think we should keep malloc() there.


thanks and
regards
sashan



fix NAT round-robin and random

2022-07-20 Thread Alexandr Nedvedicky
Hello,

below is a final version of patch for NAT issue discussed at bugs@ [1].
Patch below is updated according to feedback I got from Chris, claudio@
and hrvoje@.

The summary of changes is as follows:

- prevent infinite loop when packet hits NAT rule as follows:
pass out on em0 from 172.16.0.0/16 to any nat-to { 49/27 }
the issue has been introduced by my earlier commit [2]. The earlier
change makes pf(4) to interpret 49/27 as single IP address (POOL_NONE)
this is wrong, because pool 49/27 actually contains 32 addresses.

- while investigating the issue I've realized 'random' pool should
  rather be using arc4_uniform() with upper limit derived from mask.
  also the random number should be turned to netorder.

- also while I was debugging my change I've noticed we should be using
  pf_poolmask() to obtain address as a combination of pool address
  and result of generator (round-robin all random).

OK to commit?

thanks and
regards
sashan


[1] https://marc.info/?t=16581336821=1=2
https://marc.info/?t=16573254651=1=2
https://marc.info/?l=openbsd-bugs=165817500514813=2

[2] https://marc.info/?l=openbsd-cvs=164500117319660=2

8<---8<---8<--8<
diff --git a/sys/net/pf_lb.c b/sys/net/pf_lb.c
index d106073d372..dff2349cde7 100644
--- a/sys/net/pf_lb.c
+++ b/sys/net/pf_lb.c
@@ -344,6 +344,17 @@ pf_map_addr_sticky(sa_family_t af, struct pf_rule *r, 
struct pf_addr *saddr,
return (0);
 }
 
+uint32_t
+pf_rand_addr(uint32_t mask)
+{
+   uint32_t addr;
+
+   mask = ~ntohl(mask);
+   addr = arc4random_uniform(mask + 1);
+
+   return (htonl(addr));
+}
+
 int
 pf_map_addr(sa_family_t af, struct pf_rule *r, struct pf_addr *saddr,
 struct pf_addr *naddr, struct pf_addr *init_addr, struct pf_src_node **sns,
@@ -428,24 +439,29 @@ pf_map_addr(sa_family_t af, struct pf_rule *r, struct 
pf_addr *saddr,
} else if (init_addr != NULL && PF_AZERO(init_addr, af)) {
switch (af) {
case AF_INET:
-   rpool->counter.addr32[0] = arc4random();
+   rpool->counter.addr32[0] = pf_rand_addr(
+   rmask->addr32[0]);
break;
 #ifdef INET6
case AF_INET6:
if (rmask->addr32[3] != 0x)
-   rpool->counter.addr32[3] = arc4random();
+   rpool->counter.addr32[3] = pf_rand_addr(
+   rmask->addr32[3]);
else
break;
if (rmask->addr32[2] != 0x)
-   rpool->counter.addr32[2] = arc4random();
+   rpool->counter.addr32[2] = pf_rand_addr(
+   rmask->addr32[2]);
else
break;
if (rmask->addr32[1] != 0x)
-   rpool->counter.addr32[1] = arc4random();
+   rpool->counter.addr32[1] = pf_rand_addr(
+   rmask->addr32[1]);
else
break;
if (rmask->addr32[0] != 0x)
-   rpool->counter.addr32[0] = arc4random();
+   rpool->counter.addr32[0] = pf_rand_addr(
+   rmask->addr32[0]);
break;
 #endif /* INET6 */
default:
@@ -500,11 +516,16 @@ pf_map_addr(sa_family_t af, struct pf_rule *r, struct 
pf_addr *saddr,
}
} else if (PF_AZERO(>counter, af)) {
/*
-* fall back to POOL_NONE if there are no addresses in
-* pool
+* fall back to POOL_NONE if there is a single host
+* address in pool.
 */
-   pf_addrcpy(naddr, raddr, af);
-   break;
+   if ((af == AF_INET &&
+   rmask->addr32[0] == INADDR_BROADCAST) ||
+   (af == AF_INET6 &&
+   IN6_ARE_ADDR_EQUAL(>v6, ))) {
+   pf_addrcpy(naddr, raddr, af);
+   break;
+   }
} else if (pf_match_addr(0, raddr, rmask, >counter, af))
return (1);
 
@@ -532,9 +553,9 @@ pf_map_addr(sa_family_t af, struct pf_rule 

Re: pf: DIOCXCOMMIT and copyin

2022-07-21 Thread Alexandr Nedvedicky
Hello,

On Thu, Jul 21, 2022 at 11:13:28AM +0200, Moritz Buhl wrote:
> Hi tech,
> 
> for the other two DIOCX ioctls syzkaller showed that it is possible
> to grab netlock while doing copyin.
> 
> The same problem should exist for DIOCXCOMMIT but syzkaller didn't
> find it yet.
> 
> In case anybody can reproduce the witness lock order reversals the
> syzkaller can produce, the diff below might address the problem.
> 
> mbuhl
> 

change looks good to me.

OK sashan



Re: [External] : Re: pf igmp icmp6 multicast router alert

2022-04-27 Thread Alexandr Nedvedicky
Hello,

On Thu, Apr 28, 2022 at 12:00:09AM +0200, Alexander Bluhm wrote:
> On Fri, Apr 22, 2022 at 07:40:17PM +0200, Alexandr Nedvedicky wrote:
> > > + case IPPROTO_ICMPV6:
> > > + if (!pf_pull_hdr(pd->m, pd->off, , sizeof(icmp6),
> > > + NULL, reason, AF_INET6)) {
> > > + DPFPRINTF(LOG_NOTICE, "IPv6 short icmp6hdr");
> > > + return (PF_DROP);
> > > + }
> > > + /* ICMP multicast packets have router alert options */
> > > + switch (icmp6.icmp6_type) {
> > > + case MLD_LISTENER_QUERY:
> > > + case MLD_LISTENER_REPORT:
> > > + case MLD_LISTENER_DONE:
> >
> > I wonder if we should have a similar check we have for IPv4 address,
> > where we require a multicast address. for example in case of
> > MLD_LISTENER_QUERY the packet destination address should be fe80::/10.
> > I need to look at RFCs more closely first. Just asking in case someone 
> > else
> > knows from top of the head.
> 
> Where do we check multicast adddress for IPv4?  At this point we
> are just comparing protocol and IP options.  I would not make it
> more complex, so I will not add multicast adddress checks here.
> 

yes, you are right there is no such check for IPv4.

I was just quickly reading RFC 3810 [1], section 5 Message Formats
reads as follows:

8<---8<-8<
   MLDv2 is a sub-protocol of ICMPv6, that is, MLDv2 message types are a
   subset of ICMPv6 messages, and MLDv2 messages are identified in IPv6
   packets by a preceding Next Header value of 58.  All MLDv2 messages
   described in this document MUST be sent with a link-local IPv6 Source



Vida & CostaStandards Track[Page 13]

RFC 3810 MLDv2 for IPv6June 2004


   Address, an IPv6 Hop Limit of 1, and an IPv6 Router Alert option
   [RFC2711] in a Hop-by-Hop Options header.  (The Router Alert option
   is necessary to cause routers to examine MLDv2 messages sent to IPv6
   multicast addresses in which the routers themselves have no
   interest.)  MLDv2 Reports can be sent with the source address set to
   the unspecified address [RFC3513], if a valid link-local IPv6 source
   address has not been acquired yet for the sending interface.  (See
   section 5.2.13. for details.)
8<---8<-8<

I'm not sure if we do check for src/dst addresses and
hop limit if we process MLD message.

I'm going to look at your diff now. We can add those checks
in follow up commit in case they are missing.

regards
sashan

[1] https://www.ietf.org/rfc/rfc3810.txt



Re: [External] : pfsync mutex mpfloor

2022-04-13 Thread Alexandr Nedvedicky
On Wed, Apr 13, 2022 at 09:33:52PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> Hrvoje has hit a witness issue in pfsync.
> 
> panic: acquiring blockable sleep lock with spinlock or critical
> section held (kernel_lock) _lock
> 
> panic(81f45bb7) at panic+0xbf
> witness_checkorder(8246e970,9,0) at witness_checkorder+0xb61
> __mp_lock(8246e768) at __mp_lock+0x5f
> kpageflttrap(800020b26dc0,17) at kpageflttrap+0x173
> kerntrap(800020b26dc0) at kerntrap+0x91
> alltraps_kern_meltdown() at alltraps_kern_meltdown+0x7b
> pfsync_q_del(fd875f6336c0) at pfsync_q_del+0x70
> pfsync_delete_state(fd875f6336c0) at pfsync_delete_state+0x118
> 
> pf and pfsync are running without kernel lock, so the mutexes
> must have at least mpfloor spl protection.
> 
> ok?
> 

looks correct to me.

OK sashan



Re: [External] : Re: pf igmp icmp6 multicast router alert

2022-04-28 Thread Alexandr Nedvedicky
Hello,

On Thu, Apr 28, 2022 at 12:36:40AM +0200, Alexander Bluhm wrote:
> On Wed, Apr 27, 2022 at 11:47:45PM +0200, Alexander Bluhm wrote:
> > New diff:
> > - make off and end relative to opts array
> > - check length of IPv4 options
> > - fix call to pf_walk_option
> > - add case IP6OPT_PADN
> > - add case MLDV2_LISTENER_REPORT
> 
> - pf_pull_hdr() before pf_walk_option6() was missing
> 
> ok?

diff reads OK to me as far as I can tell.


OK sashan



Re: [External] : another copyout while holding net/pf lock in pfioctl

2022-04-28 Thread Alexandr Nedvedicky
Hello Moritz,

I like your change. There is just one nit see comment below.


> 
> 
> Index: sys/net/pf_if.c
> ===
> RCS file: /cvs/src/sys/net/pf_if.c,v
> retrieving revision 1.103
> diff -u -p -r1.103 pf_if.c
> --- sys/net/pf_if.c   26 Dec 2021 01:00:32 -  1.103
> +++ sys/net/pf_if.c   28 Apr 2022 16:36:49 -
> @@ -755,7 +755,7 @@ pfi_update_status(const char *name, stru
>   }
>  }
>  
> -int
> +void
>  pfi_get_ifaces(const char *name, struct pfi_kif *buf, int *size)
>  {
>   struct pfi_kif  *p, *nextp;
> @@ -769,16 +769,12 @@ pfi_get_ifaces(const char *name, struct 
>   if (!p->pfik_tzero)
>   p->pfik_tzero = gettime();
>   pfi_kif_ref(p, PFI_KIF_REF_RULE);
> - if (copyout(p, buf++, sizeof(*buf))) {
> - pfi_kif_unref(p, PFI_KIF_REF_RULE);
> - return (EFAULT);
> - }
> + memcpy(buf++, p, sizeof(*buf));
>   nextp = RB_NEXT(pfi_ifhead, _ifs, p);
>   pfi_kif_unref(p, PFI_KIF_REF_RULE);
>   }
>   }
I think pfi_kif_ref()/pfi_kif_unref() are no longer needed
and both can get removed.


apart from nit above your diff reads OK to me.


thanks and
regards
sashan



Re: [External] : Re: pf igmp icmp6 multicast router alert

2022-04-28 Thread Alexandr Nedvedicky
Hello,
On Thu, Apr 28, 2022 at 10:31:33PM +0200, Alexander Bluhm wrote:

> 
> Thanks.  regress/sys/netinet6/frag6 found a small issue.  If the
> icmp6 header is fragmented, we cannot pull the icmp6 header.  I had
> to copy the fragment check to the beginning of case IPPROTO_ICMPV6.
> 
> This chunk is new:
> + case IPPROTO_ICMPV6:
> + /* fragments may be short, ignore inner header then */
> + if (pd->fragoff != 0 && end < pd->off + sizeof(icmp6)) {
> + pd->off = pd->fragoff;
> + pd->proto = IPPROTO_FRAGMENT;
> + return (PF_PASS);
> + }
> 
> Although it is questionable if we should allow fragmented header
> chains, I don't want to change behavior here.  If I recall correctly
> newer RFCs forbid fragmented header chains.  But I had implemented
> this code before the IPv6 standards have discovered the security
> implications.
> 
> I am currently running a full regress.
> 

new diff still reads OK to me.

thanks and
regards
sashan



Re: [External] : Re: add sanity checks to IGMP/MLD

2022-05-02 Thread Alexandr Nedvedicky
Hello,

> 
> Checking that the TTL equals 1 is a good thing.  We should prevent
> that someone is forwarding such packets.
> 
> The router alert is a hint to routers on the way to look at these
> packets.  If they are missing, no harm is done.  Maybe some multicast
> does not work.  But there is no security argument to filter these.
> 
> I have seen IGMP packets without router alert.  Our stack has fixed
> that in OpenBSD 5.6.  Don't believe what is written in RFCs.
> 
> 
> revision 1.40
> date: 2014/05/12 09:15:00;  author: mpi;  state: Exp;  lines: +28 -5;
> Includes a router altert option (RAO) in IGMP packets.   Without this
> option, required by the RFC2236, some L3 switches do not examine the
> packets.
> 
> Based on FreeBSD's r14622 via Florian Riehm on tech@. ok bluhm@, jca@
> 

I see. new diff keeps check for ttl/hop-limit to be 1.
it also keeps check for link-local source address in IPv6


OK ? or should I also drop a check for link-local source address
in IPv6?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index f15e1ead8c0..2187d895749 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6456,8 +6456,15 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
u_short *reason)
pd->off += hlen;
pd->proto = h->ip_p;
/* IGMP packets have router alert options, allow them */
-   if (pd->proto == IPPROTO_IGMP)
-   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   if (pd->proto == IPPROTO_IGMP) {
+   /* According to RFC 1112 ttl must be set to 1. */
+   if (h->ip_ttl != 1) {
+   DPFPRINTF(LOG_NOTICE, "TTL in IGMP must be 1");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   } else
+   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   }
/* stop walking over non initial fragments */
if ((h->ip_off & htons(IP_OFFMASK)) != 0)
return (PF_PASS);
@@ -6698,7 +6705,21 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr *h, 
u_short *reason)
case MLD_LISTENER_REPORT:
case MLD_LISTENER_DONE:
case MLDV2_LISTENER_REPORT:
-   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   /*
+* According to RFC 2710 all MLD messages are
+* sent with hop-limit (ttl) set to 1, and link
+* local source address.  If either one is
+* missing then MLD message is invalid and
+* should be discarded.
+*/
+   if ((h->ip6_hlim == 1) &&
+   IN6_IS_ADDR_LINKLOCAL(>ip6_src))
+   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   else {
+   DPFPRINTF(LOG_NOTICE, "invalid MLD");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
break;
}
return (PF_PASS);



add sanity checks to IGMP/MLD

2022-05-02 Thread Alexandr Nedvedicky
hello,

bluhm@ has committed a fix [1] which makes pf to accept IGMP/MLD messages.
If I remember correct pf(4) was dropping those messages because
of Router Alert IP option being present. The IP option is mandatory
for IGMP/MLD according to RFCs.

For both protocol versions (IPv4, IPv6) standards say:

TTL/hop-limit in IP header must be set to 1

Router Alert option/extension header must be present

in case of IPv6 the MLD messages must be sent from link-local address.


diff below adds exactly those checks.

OK?

thanks and
regards
sashan


[1] https://marc.info/?l=openbsd-tech=165109904223362=2

[2] https://datatracker.ietf.org/doc/html/rfc2710
https://datatracker.ietf.org/doc/html/rfc2236

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index f15e1ead8c0..9b107751d95 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6456,8 +6456,21 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
u_short *reason)
pd->off += hlen;
pd->proto = h->ip_p;
/* IGMP packets have router alert options, allow them */
-   if (pd->proto == IPPROTO_IGMP)
-   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   if (pd->proto == IPPROTO_IGMP) {
+   /*
+* If router alert option is missing or ttl is not 1, then we
+* deal invalid IGMP packet. According to RFC 1112 ttl must be
+* set to 1. Also IP header must carry router alert option
+* as specified in RFC 2236.
+*/
+   if (ISSET(pd->badopts, PF_OPT_ROUTER_ALERT) && (h->ip_ttl == 1))
+   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   else {
+   DPFPRINTF(LOG_NOTICE, "invalid IGMP");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
+   }
/* stop walking over non initial fragments */
if ((h->ip_off & htons(IP_OFFMASK)) != 0)
return (PF_PASS);
@@ -6698,7 +6711,22 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr *h, 
u_short *reason)
case MLD_LISTENER_REPORT:
case MLD_LISTENER_DONE:
case MLDV2_LISTENER_REPORT:
-   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   /*
+* According to RFC 2710 all MLD messages are
+* sent with with router alert header and hop
+* limit (ttl) set to 1, and link local source
+* address.  If either one is missing then MLD
+* message is invalid and should be discarded.
+*/
+   if (ISSET(pd->badopts, PF_OPT_ROUTER_ALERT) &&
+   (h->ip6_hlim == 1) &&
+   IN6_IS_ADDR_LINKLOCAL(>ip6_src))
+   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   else {
+   DPFPRINTF(LOG_NOTICE, "invalid MLD");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
break;
}
return (PF_PASS);



Re: [External] : Re: add sanity checks to IGMP/MLD

2022-05-03 Thread Alexandr Nedvedicky
Hello

On Tue, May 03, 2022 at 10:44:48AM +0200, Claudio Jeker wrote:

> 
> The RFC does not use the usual MUST to enforce any of this.
> So yes, we should probably not be too strict because there is no way to
> force accept the packet when pf_walk_header() returns PF_DROP.
> 
> I agree that the TTL MUST be 1. Also the destination address must be a
> multicast address (IGMP to a unicast IP makes no sense).
>  

updated diff below requires destination address to be multicast for IGMP.

thanks and
regards
sashan

8<8<8<8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index f15e1ead8c0..feb7965c427 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6456,8 +6456,15 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
u_short *reason)
pd->off += hlen;
pd->proto = h->ip_p;
/* IGMP packets have router alert options, allow them */
-   if (pd->proto == IPPROTO_IGMP)
+   if (pd->proto == IPPROTO_IGMP) {
+   /* According to RFC 1112 ttl must be set to 1. */
+   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
+   DPFPRINTF(LOG_NOTICE, "TTL in IGMP must be 1");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   }
/* stop walking over non initial fragments */
if ((h->ip_off & htons(IP_OFFMASK)) != 0)
return (PF_PASS);
@@ -6698,6 +6705,19 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr *h, 
u_short *reason)
case MLD_LISTENER_REPORT:
case MLD_LISTENER_DONE:
case MLDV2_LISTENER_REPORT:
+   /*
+* According to RFC 2710 all MLD messages are
+* sent with hop-limit (ttl) set to 1, and link
+* local source address.  If either one is
+* missing then MLD message is invalid and
+* should be discarded.
+*/
+   if ((h->ip6_hlim != 1) ||
+   !IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
+   DPFPRINTF(LOG_NOTICE, "invalid MLD");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
break;
}



Re: [External] : Re: add sanity checks to IGMP/MLD

2022-05-03 Thread Alexandr Nedvedicky
On Tue, May 03, 2022 at 02:12:36PM +0200, Claudio Jeker wrote:
> On Tue, May 03, 2022 at 02:08:33PM +0200, Alexandr Nedvedicky wrote:
> > Hello
> > 
> > On Tue, May 03, 2022 at 10:44:48AM +0200, Claudio Jeker wrote:
> > 
> > > 
> > > The RFC does not use the usual MUST to enforce any of this.
> > > So yes, we should probably not be too strict because there is no way to
> > > force accept the packet when pf_walk_header() returns PF_DROP.
> > > 
> > > I agree that the TTL MUST be 1. Also the destination address must be a
> > > multicast address (IGMP to a unicast IP makes no sense).
> > >  
> > 
> > updated diff below requires destination address to be multicast for 
> > IGMP.
> > 
> > thanks and
> > regards
> > sashan
> > 
> > 8<8<8<8<
> > diff --git a/sys/net/pf.c b/sys/net/pf.c
> > index f15e1ead8c0..feb7965c427 100644
> > --- a/sys/net/pf.c
> > +++ b/sys/net/pf.c
> > @@ -6456,8 +6456,15 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
> > u_short *reason)
> > pd->off += hlen;
> > pd->proto = h->ip_p;
> > /* IGMP packets have router alert options, allow them */
> > -   if (pd->proto == IPPROTO_IGMP)
> > +   if (pd->proto == IPPROTO_IGMP) {
> > +   /* According to RFC 1112 ttl must be set to 1. */
> > +   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
> > +   DPFPRINTF(LOG_NOTICE, "TTL in IGMP must be 1");
> 
> Maybe use "Invalid IGMP" here (which is similar to the IPv6 case).
> Apart from that OK claudio
> 

will fix it before commit.

thanks and
regards
sashan



Re: [External] : Re: add sanity checks to IGMP/MLD

2022-05-03 Thread Alexandr Nedvedicky
Hello,

On Tue, May 03, 2022 at 09:19:44AM +0200, Alexander Bluhm wrote:
> On Tue, May 03, 2022 at 12:26:52AM +0200, Alexandr Nedvedicky wrote:
> > OK ? or should I also drop a check for link-local source address
> > in IPv6?
> 
> The link-local check makes sense.
> 

> > +   CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
> 
> You return in the if block.  No need for else.

fixed
> > +   }
> 
> Can you turn around the logic?
> 
> if (something bad)
>   return
> clear badopts
> 

sure

updated diff is below.
thanks for taking a look at it.

regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index f15e1ead8c0..bf9593952ec 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6456,8 +6456,15 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
u_short *reason)
pd->off += hlen;
pd->proto = h->ip_p;
/* IGMP packets have router alert options, allow them */
-   if (pd->proto == IPPROTO_IGMP)
+   if (pd->proto == IPPROTO_IGMP) {
+   /* According to RFC 1112 ttl must be set to 1. */
+   if (h->ip_ttl != 1) {
+   DPFPRINTF(LOG_NOTICE, "TTL in IGMP must be 1");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
+   }
/* stop walking over non initial fragments */
if ((h->ip_off & htons(IP_OFFMASK)) != 0)
return (PF_PASS);
@@ -6698,6 +6705,19 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr *h, 
u_short *reason)
case MLD_LISTENER_REPORT:
case MLD_LISTENER_DONE:
case MLDV2_LISTENER_REPORT:
+   /*
+* According to RFC 2710 all MLD messages are
+* sent with hop-limit (ttl) set to 1, and link
+* local source address.  If either one is
+* missing then MLD message is invalid and
+* should be discarded.
+*/
+   if ((h->ip6_hlim != 1) ||
+   !IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
+   DPFPRINTF(LOG_NOTICE, "invalid MLD");
+   REASON_SET(reason, PFRES_IPOPTIONS);
+   return (PF_DROP);
+   }
CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
break;
}



Re: [External] : Re: another syzkaller problem in pf

2022-05-04 Thread Alexandr Nedvedicky
On Tue, May 03, 2022 at 09:31:51PM +0200, Alexander Bluhm wrote:
> On Tue, May 03, 2022 at 07:42:34PM +0200, Moritz Buhl wrote:
> > commit 4b3977248902c22d96aaebdb5784840debc2631c
> > Author: mikeb 
> > Date:   Mon Nov 24 13:22:09 2008 +
> > 
> > Fix splasserts seen in pr 5987 by propagating a flag that discribes
> > whether we're called from the interrupt context to the functions
> > performing allocations.
> 
> These days pf was protected with kernel lock and spl.  Both are
> released when sleeping.  Now we have netlock and pflock.  These are
> rwlocks and not released during sleep.  So this old race should not
> exist anymore.
> 
> > And we are not in interrupt context.
> 
> Yes, it is ioctl(2).  I think we should always malloc with M_WAITOK
> when in syscall.  Otherwise userland would have to cope with randomly
> failing syscalls.
> 
> > If this is sound, then the only reason why pfr_destroy_ktable was called
> > is that pool_get is called with PR_NOWAIT.  And then the following diff
> > would help.
> 
> The code is too complex to be sure what the reason of the syzkaller
> panic is.  Sleep in malloc is correct anyway and may improve the
> situation.
> 
> Functions with argument values 0 or 1 are hard to read.  It would
> be much nicer to pass M_WAITOK or M_NOWAIT.  And the variable name
> "intr" does not make sense anymore.  pf does not run in interrupt
> context.  Call it "mflags" like in pfi_kif_alloc().  Or "wait" like
> in other functions.
> 
> Could you cleanup that also?

I have a diff, which touches the same area. It's a work in progress change.
It moves all memory allocations outside of NET_LOCK/PF_LOCK. I'm just
sending it for your reference now.

I'm not sure flipping a flag is a right change. In general we don't want
to hold NET_LOCK()/PF_LOCK() while waiting for memory.

regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 8315b115474..1f036e1368f 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -2217,12 +2217,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
error = ENODEV;
goto fail;
}
-   NET_LOCK();
-   PF_LOCK();
error = pfr_add_tables(io->pfrio_buffer, io->pfrio_size,
>pfrio_nadd, io->pfrio_flags | PFR_FLAG_USERIOCTL);
-   PF_UNLOCK();
-   NET_UNLOCK();
break;
}
 
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index fb23bcabe04..cbaa728b105 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -189,6 +189,7 @@ void pfr_clstats_ktable(struct 
pfr_ktable *, time_t, int);
 struct pfr_ktable  *pfr_create_ktable(struct pfr_table *, time_t, int,
int);
 voidpfr_destroy_ktables(struct pfr_ktableworkq *, int);
+voidpfr_destroy_ktables_aux(struct pfr_ktableworkq *);
 voidpfr_destroy_ktable(struct pfr_ktable *, int);
 int pfr_ktable_compare(struct pfr_ktable *,
struct pfr_ktable *);
@@ -1493,14 +1494,16 @@ pfr_clr_tables(struct pfr_table *filter, int *ndel, int 
flags)
 int
 pfr_add_tables(struct pfr_table *tbl, int size, int *nadd, int flags)
 {
-   struct pfr_ktableworkq   addq, changeq;
-   struct pfr_ktable   *p, *q, *r, key;
+   struct pfr_ktableworkq   addq, changeq, auxq;
+   struct pfr_ktable   *p, *q, *r, *n, *w, key;
int  i, rv, xadd = 0;
time_t   tzero = gettime();
 
ACCEPT_FLAGS(flags, PFR_FLAG_DUMMY);
SLIST_INIT();
SLIST_INIT();
+   SLIST_INIT();
+   /* pre-allocate all memory outside of locks */
for (i = 0; i < size; i++) {
YIELD(flags & PFR_FLAG_USERIOCTL);
if (COPYIN(tbl+i, _t, sizeof(key.pfrkt_t), flags))
@@ -1509,65 +1512,140 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
*nadd, int flags)
flags & PFR_FLAG_USERIOCTL))
senderr(EINVAL);
key.pfrkt_flags |= PFR_TFLAG_ACTIVE;
-   p = RB_FIND(pfr_ktablehead, _ktables, );
-   if (p == NULL) {
-   p = pfr_create_ktable(_t, tzero, 1,
-   !(flags & PFR_FLAG_USERIOCTL));
-   if (p == NULL)
-   senderr(ENOMEM);
-   SLIST_FOREACH(q, , pfrkt_workq) {
-   if (!pfr_ktable_compare(p, q)) {
-   pfr_destroy_ktable(p, 0);
-   goto _skip;
-   }
-   }
-   SLIST_INSERT_HEAD(, p, pfrkt_workq);
-   xadd++;
-  

pf.conf(5) clarify ICMP sloppy state handling

2022-05-08 Thread Alexandr Nedvedicky
Hello,

this tiny update to pf.conf(5) has been prompted here [1] on
pf mailing list. By default only ICMP queries are allowed
to create state in pf(4). The sloppy option relaxes that
so also ICMP replies can create a state. I think this should
be also mentioned in pf.conf(5)

OK to my suggestion below?

thanks and
regards
sashan


[1] https://marc.info/?l=openbsd-pf=165160086423472=2

8<---8<---8<--8<
diff --git a/share/man/man5/pf.conf.5 b/share/man/man5/pf.conf.5
index fe4b117994a..7389d231fe2 100644
--- a/share/man/man5/pf.conf.5
+++ b/share/man/man5/pf.conf.5
@@ -2186,6 +2186,9 @@ It cannot be used with
 .Cm modulate state
 or
 .Cm synproxy state .
+The option also relaxes handling of ICMP such that also ICMP replies
+are allowed to create state.
+By default ICMP queries only are allowed to create state.
 .It Ar timeout seconds
 Changes the
 .Ar timeout



move memory allocation in pfr_add_tables() outside of NET_LOCK()/PF_LOCK()

2022-05-08 Thread Alexandr Nedvedicky
Hello,

diff below reshuffles pfr_create_ktable() so all memory allocations happens
outside of NET_LOCK()/PF_LOCK(). pf(4) currently allocates for new table with
PR_WAITOK flag, when function is invoked on behalf of ioctl (PFR_FLAG_USERIOCTL
flag is set). PR_WAITOK for memory allocation is OK if we don't hold NET_LOCK()
of PF_LOCK() while waiting for memory. Unfortunately code in current is not OK
in this respect. Diff below fixes that.

The change pushes locks acquisition from pf_ioctl.c further down to
pfr_create_ktable(). The proposed change splits current giant loop
in pfr_create_ktable() into three loops:

the first loop allocates auxiliary list of tables we copied in.
the loop also checks for duplicity in input.

we grab NET_LOCK()/PF_LOCK() before entering second loop.

second loop moves tables instances from auxiliary list
into addq, when table does not exist already. If table
does exist it is left in auxq and we put existing table
we've just found into changeq.

If no PFR_FLAG_DUMMY is set the code will execute the
third loop, which processes addq we've constructed earlier
in second loop.

The most complicated part is the third loop, which processes
addq. The third loop must handle those situations:

- table which is being inserted is attached to main ruleset,
  in this case we must move '->pfrkt_root' table we've
  pre-allocated in the first to auxq

- the root table ('->pfrkt_root') for newly introduced table
  exists already. In that case use existing one and move
  pre-allocated table to auxq

- if root table does not exist yet, then initialize it and
  insert it into addq.


OK ?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 8315b115474..1f036e1368f 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -2217,12 +2217,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
error = ENODEV;
goto fail;
}
-   NET_LOCK();
-   PF_LOCK();
error = pfr_add_tables(io->pfrio_buffer, io->pfrio_size,
>pfrio_nadd, io->pfrio_flags | PFR_FLAG_USERIOCTL);
-   PF_UNLOCK();
-   NET_UNLOCK();
break;
}
 
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index fb23bcabe04..79fd3e0447b 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -189,6 +189,7 @@ void pfr_clstats_ktable(struct 
pfr_ktable *, time_t, int);
 struct pfr_ktable  *pfr_create_ktable(struct pfr_table *, time_t, int,
int);
 voidpfr_destroy_ktables(struct pfr_ktableworkq *, int);
+voidpfr_destroy_ktables_aux(struct pfr_ktableworkq *);
 voidpfr_destroy_ktable(struct pfr_ktable *, int);
 int pfr_ktable_compare(struct pfr_ktable *,
struct pfr_ktable *);
@@ -1493,14 +1494,16 @@ pfr_clr_tables(struct pfr_table *filter, int *ndel, int 
flags)
 int
 pfr_add_tables(struct pfr_table *tbl, int size, int *nadd, int flags)
 {
-   struct pfr_ktableworkq   addq, changeq;
-   struct pfr_ktable   *p, *q, *r, key;
+   struct pfr_ktableworkq   addq, changeq, auxq;
+   struct pfr_ktable   *p, *q, *r, *n, *w, key;
int  i, rv, xadd = 0;
time_t   tzero = gettime();
 
ACCEPT_FLAGS(flags, PFR_FLAG_DUMMY);
SLIST_INIT();
SLIST_INIT();
+   SLIST_INIT();
+   /* pre-allocate all memory outside of locks */
for (i = 0; i < size; i++) {
YIELD(flags & PFR_FLAG_USERIOCTL);
if (COPYIN(tbl+i, _t, sizeof(key.pfrkt_t), flags))
@@ -1509,65 +1512,142 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
*nadd, int flags)
flags & PFR_FLAG_USERIOCTL))
senderr(EINVAL);
key.pfrkt_flags |= PFR_TFLAG_ACTIVE;
-   p = RB_FIND(pfr_ktablehead, _ktables, );
+   p = pfr_create_ktable(_t, tzero, 0,
+   !(flags & PFR_FLAG_USERIOCTL));
+   if (p == NULL)
+   senderr(ENOMEM);
+
+   /*
+* Note: we also pre-allocate a root table here. We keep it
+* at ->pfrkt_root, which we must not forget about.
+*/
+   key.pfrkt_flags = 0;
+   memset(key.pfrkt_anchor, 0, sizeof(key.pfrkt_anchor));
+   p->pfrkt_root = pfr_create_ktable(_t, 0, 0,
+   !(flags & PFR_FLAG_USERIOCTL));
+   if (p->pfrkt_root == NULL) {
+   pfr_destroy_ktable(p, 0);
+   senderr(ENOMEM);
+   }
+
+   SLIST_FOREACH(q, , pfrkt_workq) {
+   

Re: [External] : net lock priority

2022-05-08 Thread Alexandr Nedvedicky
Hello,

On Fri, May 06, 2022 at 09:50:30PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> When creating network load by forwarding packets, SSH gets unusable
> and ping time gets above 10 seconds.
> 
> Problem is that while multiple forwarding threads are running with
> shared net lock, the exclusive lock cannot be acquired.  This is
> unfair.
> 
> Diff below prevents that a read lock is granted when another thread
> is waiting for the exclusive lock.  With that ping time stays under
> 300 ms.
> 
> Does this read write lock prio change make sense?

I can confirm this diff helps. And if I understand things right,
then we basically let all readers to take a slow path via a sleep,
if there is a writer waiting (think of like putting readers
behind the waiting writer in queue).

I was afraid the change can trade writer starvation for reader
starvation. However I think it is not the case after seeing 
rw_do_exit(), where we zero out lock after writer does rw_exit().
this basically gives equal chance to all readers/writers to acquire
the lock.

not my department, however I would say change looks good to me.

OK sashan@ (given my OK counts in this area)


> 
> bluhm
> 
> Index: kern/kern_rwlock.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_rwlock.c,v
> retrieving revision 1.47
> diff -u -p -r1.47 kern_rwlock.c
> --- kern/kern_rwlock.c8 Feb 2021 08:18:45 -   1.47
> +++ kern/kern_rwlock.c6 May 2022 12:08:01 -
> @@ -81,7 +81,7 @@ static const struct rwlock_op {
>   },
>   {   /* RW_READ */
>   RWLOCK_READ_INCR,
> - RWLOCK_WRLOCK,
> + RWLOCK_WRLOCK | RWLOCK_WRWANT,
>   RWLOCK_WAIT,
>   0,
>   PLOCK
> @@ -103,7 +103,7 @@ rw_enter_read(struct rwlock *rwl)
>  {
>   unsigned long owner = rwl->rwl_owner;
>  
> - if (__predict_false((owner & RWLOCK_WRLOCK) ||
> + if (__predict_false((owner & (RWLOCK_WRLOCK | RWLOCK_WRWANT)) ||
>   rw_cas(>rwl_owner, owner, owner + RWLOCK_READ_INCR)))
>   rw_enter(rwl, RW_READ);
>   else {
> 



Re: [External] : Re: pf.conf(5) clarify ICMP sloppy state handling

2022-05-08 Thread Alexandr Nedvedicky
Hello,

On Sun, May 08, 2022 at 08:06:57PM +0200, Alexander Bluhm wrote:
> On Sun, May 08, 2022 at 06:37:57PM +0200, Alexandr Nedvedicky wrote:
> > this tiny update to pf.conf(5) has been prompted here [1] on
> > pf mailing list. By default only ICMP queries are allowed
> > to create state in pf(4). The sloppy option relaxes that
> > so also ICMP replies can create a state. I think this should
> > be also mentioned in pf.conf(5)
> >
> > OK to my suggestion below?
> 
> I would make it a bit shorter.  pf.conf(5) is very long already.
> 
> With this option ICMP replies can create states.
> 
> Does this describe everything?

yes, it does. I Like it. Updated diff below.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/share/man/man5/pf.conf.5 b/share/man/man5/pf.conf.5
index fe4b117994a..e4af2a37c5e 100644
--- a/share/man/man5/pf.conf.5
+++ b/share/man/man5/pf.conf.5
@@ -2186,6 +2186,7 @@ It cannot be used with
 .Cm modulate state
 or
 .Cm synproxy state .
+With this option ICMP replies can create states.
 .It Ar timeout seconds
 Changes the
 .Ar timeout



Re: [External] : net lock priority

2022-05-08 Thread Alexandr Nedvedicky
Hello,
On Sun, May 08, 2022 at 07:31:46PM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> On Fri, May 06, 2022 at 09:50:30PM +0200, Alexander Bluhm wrote:
> > Hi,
> > 
> > When creating network load by forwarding packets, SSH gets unusable
> > and ping time gets above 10 seconds.
> > 
> > Problem is that while multiple forwarding threads are running with
> > shared net lock, the exclusive lock cannot be acquired.  This is
> > unfair.
> > 
> > Diff below prevents that a read lock is granted when another thread
> > is waiting for the exclusive lock.  With that ping time stays under
> > 300 ms.
> > 
> > Does this read write lock prio change make sense?
> 
> I can confirm this diff helps. And if I understand things right,
> then we basically let all readers to take a slow path via a sleep,
> if there is a writer waiting (think of like putting readers
> behind the waiting writer in queue).
> 
> I was afraid the change can trade writer starvation for reader
> starvation. However I think it is not the case after seeing 
> rw_do_exit(), where we zero out lock after writer does rw_exit().
> this basically gives equal chance to all readers/writers to acquire
> the lock.
> 
> not my department, however I would say change looks good to me.
> 

I was perhaps too fast in my judgement. I think we should also
look more closely at rw_do_exit(): 

335 /* membar_exit_before_atomic() has to precede call of this function. */
336 void
337 rw_do_exit(struct rwlock *rwl, unsigned long wrlock)
338 {
339 unsigned long owner, set;
340
341 do {
342 owner = rwl->rwl_owner;
343 if (wrlock)
344 set = 0;
345 else
346 set = (owner - RWLOCK_READ_INCR) &
347 ~(RWLOCK_WAIT|RWLOCK_WRWANT);
348 } while (__predict_false(rw_cas(>rwl_owner, owner, set)));
349
350 if (owner & RWLOCK_WAIT)
351 wakeup(rwl);
352

my question is why do we reset RWLOCK_WAIT and RWLOCK_WRWANT flags?
I think those flags should be reset the last reader gone. Perhaps
the else branch for reader requires this:

else {
set = (owner - RWLOCK_READ_INCR) &
~(RWLOCK_WAIT|RWLOCK_WRWANT)
if (set != 0)
set |= (owner & RWLOCK_MASK);
}


thanks and
regards
sashan



Re: [External] : net lock priority

2022-05-08 Thread Alexandr Nedvedicky
Hello,

On Sun, May 08, 2022 at 10:13:20PM +0200, Alexander Bluhm wrote:
> On Sun, May 08, 2022 at 09:19:23PM +0200, Alexander Bluhm wrote:
> > I will run my tests with the diff below.
> 
> With the third chunk reboot hangs during reordering libraries in
> vmmaplk.  So this needs more thought.

   rw_do_exit() looks odd:

335 /* membar_exit_before_atomic() has to precede call of this function. */
336 void
337 rw_do_exit(struct rwlock *rwl, unsigned long wrlock)
338 {
339 unsigned long owner, set;
340 
341 do {
342 owner = rwl->rwl_owner;
343 if (wrlock)
344 set = 0;
345 else
346 set = (owner - RWLOCK_READ_INCR) & ~RWLOCK_WAIT;
347 } while (__predict_false(rw_cas(>rwl_owner, owner, set)));
348 
349 if (owner & RWLOCK_WAIT)

350 wakeup(rwl);
351 }

what bothers me is the situation where there are
more than one reader. The line 350 is executed by
the first reader which drops the lock. So the process
woken up by wakeup(rwl) are going to find out the
lock is still occupied by remaining readers.

regards
sashan 



Re: [External] : net lock priority

2022-05-09 Thread Alexandr Nedvedicky
Hello,

On Mon, May 09, 2022 at 04:34:00PM +0200, Alexander Bluhm wrote:
> On Sun, May 08, 2022 at 10:54:01PM +0200, Alexandr Nedvedicky wrote:
> > what bothers me is the situation where there are
> > more than one reader. The line 350 is executed by
> > the first reader which drops the lock. So the process
> > woken up by wakeup(rwl) are going to find out the
> > lock is still occupied by remaining readers.
> 
> wakeup() activates all sleepers.  They should check and sleep again.
> Maybe a little bit of wasted resources, but I don't see a severe
> problem.

I came to the same conclusion.

> 
> I did a little digging in history.  In rev 1.3 of kern_rwlock.c
> we had case RW_READ: need_wait = RWLOCK_WRLOCK | RWLOCK_WRWANT;
> and the commet "Let writers through before obtaining read lock."
> 
> The comment
>  * RW_READ  RWLOCK_WRLOCK|RWLOCK_WRWANT may not be set. We increment
>  *  with RWLOCK_READ_INCR. RWLOCK_WAIT while waiting.
> is still there, just the RWLOCK_WRWANT got lost from the condition.
> 
> So I think we should get back the original reader writer priority.
> 
> Regarding the race in rw_do_exit() that sashan@ found I also found
> a comment in rev 1.7.
> 
> /*
>  * Potential MP race here. If the owner had WRWANT set we cleared
>  * it and a reader can sneak in before a writer. Do we care?
>  */
> 
> I do not want to change anything to that behavior now.  There is
> no easy fix and I did not see the problem during testing.  But we
> can put the comment back to clarify the situation.
> 
> ok?

this certainly works for me. 

OK sashan



Re: [External] : Re: move memory allocation in pfr_add_tables() outside of NET_LOCK()/PF_LOCK()

2022-05-09 Thread Alexandr Nedvedicky
Hello,

thanks for taking a look.


> > +   SLIST_FOREACH(q, , pfrkt_workq) {
> > +   if (!pfr_ktable_compare(p, q)) {
> > +   /*
> > +* We need no lock here, because `p` is empty,
> > +* there are no rules or shadow tables
> > +* attached.
> > +*/
> > +   pfr_destroy_ktable(p->pfrkt_root, 0);
> > +   p->pfrkt_root = NULL;
> > +   pfr_destroy_ktable(p, 0);
> > +   break;
> This break...
> > +   }
> > +   }
> ... end here
> > +
> >
> ... and then we insert a destroyed p

yes. you are right. new diff addresses that with change as follows:

@@ -1542,9 +1542,8 @@ pfr_add_tables(struct pfr_table ...)
pfr_destroy_ktable(p, 0);
break;
}
+   SLIST_INSERT_HEAD(, p, pfrkt_workq);
}
-
-   SLIST_INSERT_HEAD(, p, pfrkt_workq);
}
 
/*



> > +   SLIST_REMOVE(, n, pfr_ktable, pfrkt_workq);
> > +   SLIST_INSERT_HEAD(, n, pfrkt_workq);
> > +   } else if (!(flags & PFR_FLAG_DUMMY)) {
> 
> I compared the old and new code to see if it is equivalent.
> Before the condtion looked like this.

very good point. I think this what needs to be done:

@@ -1558,7 +1558,8 @@ pfr_add_tables(struct pfr_table *tbl, ...)
if (p == NULL) {
SLIST_REMOVE(, n, pfr_ktable, pfrkt_workq);
SLIST_INSERT_HEAD(, n, pfrkt_workq);
-   } else if (!(flags & PFR_FLAG_DUMMY)) {
+   } else if (!(flags & PFR_FLAG_DUMMY) &&
+   !(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
p->pfrkt_nflags = (p->pfrkt_flags &
~PFR_TFLAG_USRMASK) | key.pfrkt_flags;
SLIST_INSERT_HEAD(, p, pfrkt_workq);



> > +   SLIST_FOREACH(r, , pfrkt_workq) {
> > +   if (!pfr_ktable_compare(r, q)) {
> > +   /*
> > +* `r` is our root table we've found
> > +* earlier, `q` can get dropped.
> > +*/
> > +   p->pfrkt_root = r;
> > +   SLIST_INSERT_HEAD(, q,
> > +   pfrkt_workq);
> > +   continue;
> This continue goes to the r list, but I think you want to continue p list.
> > }
> > }

yes, exactly. we want to continue with outer loop if we break from inner
one. this is what I want to do:

@@ -1617,9 +1618,12 @@ pfr_add_tables(struct pfr_table *tbl, ...)
p->pfrkt_root = r;
SLIST_INSERT_HEAD(, q,
pfrkt_workq);
-   continue;
+   break;
}
}
+   if (r != SLIST_END())
+   continue;
+
q->pfrkt_rs = pf_find_or_create_ruleset(
q->pfrkt_root->pfrkt_anchor);
/*


updated diff is below.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 8315b115474..1f036e1368f 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -2217,12 +2217,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
error = ENODEV;
goto fail;
}
-   NET_LOCK();
-   PF_LOCK();
error = pfr_add_tables(io->pfrio_buffer, io->pfrio_size,
>pfrio_nadd, io->pfrio_flags | PFR_FLAG_USERIOCTL);
-   PF_UNLOCK();
-   NET_UNLOCK();
break;
}
 
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index fb23bcabe04..4dd49ea3550 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -189,6 +189,7 @@ void pfr_clstats_ktable(struct 
pfr_ktable *, time_t, int);
 struct pfr_ktable  *pfr_create_ktable(struct pfr_table *, time_t, int,
int);
 voidpfr_destroy_ktables(struct pfr_ktableworkq *, int);
+voidpfr_destroy_ktables_aux(struct pfr_ktableworkq *);
 

Re: [External] : Re: pf.conf(5) clarify ICMP sloppy state handling

2022-05-09 Thread Alexandr Nedvedicky
Hello,

I'm sorry I was too fast with commit. I've just committed
what's been suggested by bluhm@:

@@ -2186,6 +2186,7 @@ It cannot be used with
 .Cm modulate state
 or
 .Cm synproxy state .
+With this option ICMP replies can create states.
 .It Ar timeout seconds
 Changes the
 .Ar timeout


> This is helpful, but because it's so surprising that "pass proto icmp"
> doesn't pass all icmp traffic, I think it would help to mention it where
> "proto icmp" is described too.
> 
> Also, the top of the text about "sloppy" just talks about the sloppy
> TCP connection tracker, I think perhaps it would be better to lead
> with something that suggests it has multiple functions for different
> protocols?

I don't object to any of your enhancements.

reads OK sashan

> 
> Index: man5/pf.conf.5
> ===
> RCS file: /cvs/src/share/man/man5/pf.conf.5,v
> retrieving revision 1.594
> diff -u -p -r1.594 pf.conf.5
> --- man5/pf.conf.59 May 2022 20:29:23 -   1.594
> +++ man5/pf.conf.59 May 2022 21:05:48 -
> @@ -594,6 +594,13 @@ or
>  .Pc
>  must match.
>  .Pp
> +ICMP responses are not permitted unless they either match an
> +existing request, or unless
> +.Cm no state
> +or
> +.Cm keep state (sloppy)
> +is specified.
> +.Pp
>  .It Cm label Ar string
>  Adds a label to the rule, which can be used to identify the rule.
>  For instance,
> @@ -2177,7 +2184,7 @@ States created by this rule are exported
>  .Xr pflow 4
>  interface.
>  .It Cm sloppy
> -Uses a sloppy TCP connection tracker that does not check sequence
> +For TCP, uses a sloppy connection tracker that does not check sequence
>  numbers at all, which makes insertion and ICMP teardown attacks way
>  easier.
>  This is intended to be used in situations where one does not see all
> @@ -2186,7 +2193,8 @@ It cannot be used with
>  .Cm modulate state
>  or
>  .Cm synproxy state .
> -With this option ICMP replies can create states.
> +For ICMP, this option allows states to be created from replies,
> +not just requests.
>  .It Ar timeout seconds
>  Changes the
>  .Ar timeout
> 



Re: [External] : Re: move memory allocation in pfr_add_tables() outside of NET_LOCK()/PF_LOCK()

2022-05-09 Thread Alexandr Nedvedicky
Hello,

On Tue, May 10, 2022 at 12:18:15AM +0200, Alexander Bluhm wrote:
> On Mon, May 09, 2022 at 11:11:03PM +0200, Alexandr Nedvedicky wrote:
> > > ... and then we insert a destroyed p
> >
> > yes. you are right. new diff addresses that with change as follows:
> >
> > @@ -1542,9 +1542,8 @@ pfr_add_tables(struct pfr_table ...)
> > pfr_destroy_ktable(p, 0);
> > break;
> > }
> > +   SLIST_INSERT_HEAD(, p, pfrkt_workq);
> This inserts p each time you run over q list.
> 
> Should we do it like this?  It is similar to your solution at the
> other loop.
> 
> SLIST_FOREACH(q, , pfrkt_workq) {
> if (!pfr_ktable_compare(p, q)) {
> /*
>  * We need no lock here, because `p` is empty,
>  * there are no rules or shadow tables
>  * attached.
>  */
> pfr_destroy_ktable(p->pfrkt_root, 0);
> p->pfrkt_root = NULL;
> pfr_destroy_ktable(p, 0);
> break;
> }
> }
>   if (q != NULL)
>   continue;
> SLIST_INSERT_HEAD(, p, pfrkt_workq);

yes, this is exactly code I want.
> 
> 
> > > I compared the old and new code to see if it is equivalent.
> > > Before the condtion looked like this.
> >
> > very good point. I think this what needs to be done:
> >
> > @@ -1558,7 +1558,8 @@ pfr_add_tables(struct pfr_table *tbl, ...)
> > if (p == NULL) {
> > SLIST_REMOVE(, n, pfr_ktable, pfrkt_workq);
> > SLIST_INSERT_HEAD(, n, pfrkt_workq);
> > -   } else if (!(flags & PFR_FLAG_DUMMY)) {
> > +   } else if (!(flags & PFR_FLAG_DUMMY) &&
> I guess PFR_FLAG_DUMMY check is an optimization.

not really. I don't want to alter any flags on existing tables,
when running 'pfctl -n ...'. the '-n' sets PFR_FLAG_DUMMY.

> > +   !(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
> Indent should be one tab to the left.
> > p->pfrkt_nflags = (p->pfrkt_flags &
> > ~PFR_TFLAG_USRMASK) | key.pfrkt_flags;
> > SLIST_INSERT_HEAD(, p, pfrkt_workq);
> 
> Old code had this to avoid duplicate entries.  Do we need it?
> SLIST_FOREACH(q, , pfrkt_workq)
> if (!pfr_ktable_compare(, q))
> goto _skip;

the duplicate entries are removed in the first loop, which copies
data in. So we don't need this check here.

> 
> > > This continue goes to the r list, but I think you want to continue p list.
> > > > }
> > > > }
> >
> > yes, exactly. we want to continue with outer loop if we break from inner
> > one. this is what I want to do:
> >
> > @@ -1617,9 +1618,12 @@ pfr_add_tables(struct pfr_table *tbl, ...)
> > p->pfrkt_root = r;
> > SLIST_INSERT_HEAD(, q,
> > pfrkt_workq);
> > -   continue;
> > +   break;
> > }
> > }
> > +   if (r != SLIST_END())
> Could you use if (r != NULL) ?  Noone uses SLIST_END macros.
> > +   continue;
> > +

sure. 

updated diff is below.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 8315b115474..1f036e1368f 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -2217,12 +2217,8 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
error = ENODEV;
goto fail;
}
-   NET_LOCK();
-   PF_LOCK();
error = pfr_add_tables(io->pfrio_buffer, io->pfrio_size,
>pfrio_nadd, io->pfrio_flags | PFR_FLAG_USERIOCTL);
-   PF_UNLOCK();
-   NET_UNL

Re: [External] : divert packet mutex

2022-05-05 Thread Alexandr Nedvedicky
Hello,

On Wed, May 04, 2022 at 10:50:15PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> I missed that pf divert packet is not MP safe yet.  The problem is
> that divert_packet() is called from pf with shared net lock and
> sbappendaddr() needs exclusive net lock.
> 
> The direct call from pf in IP layer to divert in protocol layer is
> not nice.  I not sure how to address that.

one way around it would be to dispatch diverted packets
to task. we do the same for route-to.

> 
> As a first step clean up divert_packet():
> - the function never returns an error
> - call variables so and sin like everywhere else
> - use goto bad for error handling
> - fix error counter
> - introduce mutex and refcounting for inp like in the other pcb
>   functions
> 
> Divert packet is still not MP safe, I will fix it later.

change looks good to me.

OK sashan



Re: [External] : Re: another syzkaller problem in pf

2022-05-04 Thread Alexandr Nedvedicky
On Wed, May 04, 2022 at 04:26:18PM +0200, Alexander Bluhm wrote:
> On Wed, May 04, 2022 at 02:21:11PM +0200, Alexandr Nedvedicky wrote:
> > I'm not sure flipping a flag is a right change. In general we don't want
> > to hold NET_LOCK()/PF_LOCK() while waiting for memory.
> 
> - We must not wait for memory when in the packet processing hot path.
>   Drop the packet instead.
> 
> - We should not wait for memory when holding pf lock.  Then we can
>   replace rw lock with a mutex or something else later.
> 
> - Waiting for memory with net lock is unavoidable.  We have to
>   wait when coming from system call.  Doing preallocation may be
>   possible in some cases but code may get too complicated elsewhere.
> 
> If getting the allocation out of the locks is doable here, it could
> be the best solution.

I think it's doable. I'll try to finish the for tables and start sending
diffs hopefully next week.

regards
sashan



Re: fix NAT round-robin and random

2022-08-02 Thread Alexandr Nedvedicky
Hello,

On Mon, Aug 01, 2022 at 06:37:58PM +0200, Hrvoje Popovski wrote:
> On 20.7.2022. 22:27, Alexandr Nedvedicky wrote:
> > Hello,
> > 
> > below is a final version of patch for NAT issue discussed at bugs@ [1].
> > Patch below is updated according to feedback I got from Chris, claudio@
> > and hrvoje@.
> > 
> > The summary of changes is as follows:
> > 
> > - prevent infinite loop when packet hits NAT rule as follows:
> > pass out on em0 from 172.16.0.0/16 to any nat-to { 49/27 }
> > the issue has been introduced by my earlier commit [2]. The earlier
> > change makes pf(4) to interpret 49/27 as single IP address 
> > (POOL_NONE)
> > this is wrong, because pool 49/27 actually contains 32 addresses.
> > 
> > - while investigating the issue I've realized 'random' pool should
> >   rather be using arc4_uniform() with upper limit derived from mask.
> >   also the random number should be turned to netorder.
> > 
> > - also while I was debugging my change I've noticed we should be using
> >   pf_poolmask() to obtain address as a combination of pool address
> >   and result of generator (round-robin all random).
> > 
> > OK to commit?
> > 
> > thanks and
> > regards
> > sashan
> > 
> > 
> > [1] https://marc.info/?t=16581336821=1=2
> > https://marc.info/?t=16573254651=1=2
> > https://marc.info/?l=openbsd-bugs=165817500514813=2
> > 
> > [2] https://marc.info/?l=openbsd-cvs=164500117319660=2
> 
> 
> Hi all,
> 
> I've tested this diff and from what I see NAT behaves as it should and
> it's changing ip addresses quite nicely
> 
> 

thank you Hrvoje for carrying independent test. I'll commit the diff
later today if there will be no objection.

thanks and
regards
sashan



Re: make kernel build without INET6 again (pf_lb.c)

2022-08-30 Thread Alexandr Nedvedicky
surre, looks good to me.

OK sashan

On Tue, Aug 30, 2022 at 09:45:17PM +0200, Sebastian Benoit wrote:
> ok?
> 
> diff --git sys/net/pf_lb.c sys/net/pf_lb.c
> index 588115cbff7..905af42e463 100644
> --- sys/net/pf_lb.c
> +++ sys/net/pf_lb.c
> @@ -519,13 +519,18 @@ pf_map_addr(sa_family_t af, struct pf_rule *r, struct 
> pf_addr *saddr,
>* fall back to POOL_NONE if there is a single host
>* address in pool.
>*/
> - if ((af == AF_INET &&
> - rmask->addr32[0] == INADDR_BROADCAST) ||
> - (af == AF_INET6 &&
> - IN6_ARE_ADDR_EQUAL(>v6, ))) {
> + if (af == AF_INET &&
> + rmask->addr32[0] == INADDR_BROADCAST) {
>   pf_addrcpy(naddr, raddr, af);
>   break;
>   }
> +#ifdef INET6
> + if (af == AF_INET6 &&
> + IN6_ARE_ADDR_EQUAL(>v6, )) {
> + pf_addrcpy(naddr, raddr, af);
> + break;
> + }
> +#endif
>   } else if (pf_match_addr(0, raddr, rmask, >counter, af))
>   return (1);
>  
> 



Re: pf fragment lock

2022-08-22 Thread Alexandr Nedvedicky
Hello,

On Mon, Aug 22, 2022 at 08:45:29PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> Hrvoje managed to crash the kernel in pf fragment reassembly.
> 
> > r620-1# pfctl -e
> > pf enabled
> > r620-1# pfctl -f /etc/pf.conf
> > uvm_fault(0x824b9278, 0xb7, 0, 2) -> e
> > kernel: page fault trap, code=0
> > Stopped at  pf_free_fragment+0x77:  movq%rax,0xb8(%rcx)
> > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> >  350110  79921  00x13  02K tcpbench
> > *301306  98239  0 0x14000  0x2004  softnet
> >   55791   2358  0 0x14000  0x2003  softnet
> >  176238  13130  0 0x14000  0x2001  softnet
> >   66977  54316  0 0x14000  0x2005  systq
> >  165986  42679  0 0x14000 0x42000  softclock
> > pf_free_fragment(fd83a5022010) at pf_free_fragment+0x77
> > pf_create_fragment(800022d717ce) at pf_create_fragment+0xc8
> > pf_reassemble6(800022d71708,800022d71648,30,0,1,800022d717ce) at
> > pf_reassemble6+0x51
> > pf_normalize_ip6(800022d716c8,800022d717ce) at pf_normalize_ip6+0x8a
> > pf_test(18,1,80095048,800022d71978) at pf_test+0x30d
> > ip6_input_if(800022d71978,800022d71984,29,0,80095048) at
> > ip6_input_if+0x1ae
> > ipv6_input(80095048,fd80a3f1dc00) at ipv6_input+0x39
> > ether_input(80095048,fd80a3f1dc00) at ether_input+0x3b1
> > if_input_process(80095048,800022d71a68) at if_input_process+0x6f
> > ifiq_process(80099900) at ifiq_process+0x69
> > taskq_thread(80032200) at taskq_thread+0x11a
> > end trace frame: 0x0, count: 4
> > https://urldefense.com/v3/__https://www.openbsd.org/ddb.html__;!!ACWV5N9M2RV99hQ!Lp6PoTOkfwK6l_zUKCzgqp4LJWQKYlUZuOU7xnK4oGjI-tUtS1PjKwdJGLAXwJ8_jO2zCf0RZnVm5js3tfsTgQDDz5JAmCqmOQ$
> >describes the minimum info required in
> > bug reports.  Insufficient info makes it difficult to find and fix bugs.
> > ddb{4}>
> 
> It crashes here in pf_free_fragment()
>266  TAILQ_REMOVE(_fragqueue, frag, frag_next);
> 
> Putting a pf frag lock into pf_create_fragment() around
> pf_flush_fragments() does not look sufficient.  The pf_nfrents++
> also needs protection.  So I moved the lock around pf_reassemble().
> 
> ok?
> 

fix looks good to me.
thanks for taking care of it.

OK sashan



yet another follow-up to pfr_add_tables()

2022-10-04 Thread Alexandr Nedvedicky
Hello,

diff below fixes my a tiny glitch I've introduced with
commit [1] back in May. Fortunately the impact is hardly
noticeable (I think).

The current code reads as follows:
1495 pfr_add_tables(struct pfr_table *tbl, int size, int *nadd, int flags)
1496 {

1507 for (i = 0; i < size; i++) {
1508 YIELD(flags & PFR_FLAG_USERIOCTL);
1509 if (COPYIN(tbl+i, _t, sizeof(key.pfrkt_t), 
flags))
1510 senderr(EFAULT);
1511 if (pfr_validate_table(_t, PFR_TFLAG_USRMASK,
1512 flags & PFR_FLAG_USERIOCTL))
1513 senderr(EINVAL);
1514 key.pfrkt_flags |= PFR_TFLAG_ACTIVE;
1515 p = pfr_create_ktable(_t, tzero, 0,
1516 (flags & PFR_FLAG_USERIOCTL? PR_WAITOK : PR_NOWAIT));
1517 if (p == NULL)
1518 senderr(ENOMEM);
1519 
1520 /*
1521  * Note: we also pre-allocate a root table here. We keep it
1522  * at ->pfrkt_root, which we must not forget about.
1523  */
1524 key.pfrkt_flags = 0;

1551 }
1552 
1553 /*
1554  * auxq contains freshly allocated tables with no dups.
1555  * also note there are no rulesets attached, because
1556  * the attach operation requires PF_LOCK().
1557  */
1558 NET_LOCK();
1559 PF_LOCK();
1560 SLIST_FOREACH_SAFE(n, , pfrkt_workq, w) {
1561 p = RB_FIND(pfr_ktablehead, _ktables, n);
1562 if (p == NULL) {
1563 SLIST_REMOVE(, n, pfr_ktable, pfrkt_workq);
1564 SLIST_INSERT_HEAD(, n, pfrkt_workq);
1565 xadd++;
1566 } else if (!(flags & PFR_FLAG_DUMMY) &&
1567 !(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
1568 p->pfrkt_nflags = (p->pfrkt_flags &
1569 ~PFR_TFLAG_USRMASK) | key.pfrkt_flags;
1570 SLIST_INSERT_HEAD(, p, pfrkt_workq);
1571 }
1572 }

at line 1569 local variable 'key.pfrkt_flags' is always 0, because
it's set to 0 at line 1524. Thus the code has no effect. We should
be using 'n->pfrkt_flags' instead at line 1569.

It took me a while to figure out when else branch at line 1568 gets
executed. If I understand code right the branch handles kind of
edge case which deals with situation the table exists already but
transaction which inserts/creates table is not yet committed.
For example consider situation of pfctl which creates a table 'test'
via pfr_ina_define() (on behalf of transaction when it loads pf.conf).
The pfctl process gets terminated for whatever reason before transaction
is committed. In this case we might be left with inactive table hanging
in pf(4) driver. Such table is never reported by 'pfctl -sT' but is
still present in pfr_ktables tree.

There is pf_tbltrans.c test program which emulates a theoretical failure
of pfctl. It creates a table via transaction (like parse.y does) but
it never commits the transaction. This is what happens when we run
commands below on current pf(4):
netlock# ./pf_tbltrans test
netlock# pfctl -t test -T add 192.168.1.0/24
pfctl: Table does not exist
netlock# pfctl -FT
0 tables deleted.
netlock# pfctl -t test -T add 192.168.1.0/24
pfctl: Table does not exist
There 'pfctl -t test -T add 192.168.1.0/24' fails because it is not able
to add address to inactive (not committed) table 'test'. As soon as
we replay the same scenario on system with diff applied the
story looks as follows:
netlock# ./pf_tbltrans test 
   
netlock# pfctl -t test -T add 192.168.1.0/24
1/1 addresses added.
The reason why it works is that fixed line 1569 kicks in and
active flag gets set which makes table active ('visible') for
all other operations.

OK to commit?

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-cvs=16522243003=2

8<---8<-8<---8<-
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index f537aac2387..f6d5c355b1f 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -1566,7 +1566,7 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
*nadd, int flags)
} else if (!(flags & PFR_FLAG_DUMMY) &&
!(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
p->pfrkt_nflags = (p->pfrkt_flags &
-   ~PFR_TFLAG_USRMASK) | key.pfrkt_flags;
+   ~PFR_TFLAG_USRMASK) | n->pfrkt_flags;
SLIST_INSERT_HEAD(, p, pfrkt_workq);
}
}
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int
main(int argc, const char *argv[])
{
int pf;
 

Re: pf.conf / scrub resulting in invalid checksum

2022-10-10 Thread Alexandr Nedvedicky
Hello,

On Mon, Oct 10, 2022 at 06:52:00AM +0200, Bjorn Ketelaars wrote:

> 
> (reply also send to tech@)
> 
> In 2011 henning@ removed fiddling with the ip checksum of normalised
> packets in sys/net/pf_norm.c (r1.131). Rationale was that the checksum
> is always recalculated in all output paths anyway. In 2016 procter@
> reintroduced checksum modification to preserve end-to-end checksums
> (r1.189 of sys/net/pf_norm.c). Although I'm not sure, it seems as if
> somewhere in that timeslot checksum recalculation of normalised packets
> was broken.
> 
> Issue got caught as net/mcast-proxy strictly adheres to RFC2236, which
> states that "When receiving packets, the checksum MUST be verified
> before processing a packet". After scrubbing a packet the checksum
> becomes invalid thus failing verification by net/mcast-proxy.
> 
> I found two workarounds:
> 1.) rip out checksum verification from net/mcast-proxy;
> 2.) don't scrub packets with, e.g., id-random and/or no-df set.
> 
> However, proposed solution is to fix this in pf. Diff below fixes the
> issue at hand.
> 
> Comments/OK?

diff reads good to me. change makes sense in my opinion.

OK sashan


> 
> 
> Index: sys/net/pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1140
> diff -u -p -r1.1140 pf.c
> --- sys/net/pf.c  3 Sep 2022 19:22:19 -   1.1140
> +++ sys/net/pf.c  10 Oct 2022 03:22:06 -
> @@ -164,7 +164,7 @@ void   pf_add_threshold(struct 
> pf_thres
>  int   pf_check_threshold(struct pf_threshold *);
>  int   pf_check_tcp_cksum(struct mbuf *, int, int,
>   sa_family_t);
> -static __inline void  pf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
> +__inline void pf_cksum_fixup(u_int16_t *, u_int16_t, 
> u_int16_t,
>   u_int8_t);
>  void  pf_cksum_fixup_a(u_int16_t *, const struct pf_addr *,
>   const struct pf_addr *, sa_family_t, u_int8_t);
> @@ -1937,7 +1937,7 @@ pf_addr_wrap_neq(struct pf_addr_wrap *aw
>   * Note: this serves also as a reduction step for at most one add (as the
>   * trailing mod 2^16 prevents further reductions by destroying carries).
>   */
> -static __inline void
> +__inline void
>  pf_cksum_fixup(u_int16_t *cksum, u_int16_t was, u_int16_t now,
>  u_int8_t proto)
>  {
> Index: sys/net/pf_norm.c
> ===
> RCS file: /cvs/src/sys/net/pf_norm.c,v
> retrieving revision 1.224
> diff -u -p -r1.224 pf_norm.c
> --- sys/net/pf_norm.c 22 Aug 2022 20:35:39 -  1.224
> +++ sys/net/pf_norm.c 10 Oct 2022 03:22:06 -
> @@ -1646,14 +1646,21 @@ pf_scrub(struct mbuf *m, u_int16_t flags
>  #ifdef INET6
>   struct ip6_hdr  *h6 = mtod(m, struct ip6_hdr *);
>  #endif   /* INET6 */
> + u_int16_told;
>  
>   /* Clear IP_DF if no-df was requested */
> - if (flags & PFSTATE_NODF && af == AF_INET && h->ip_off & htons(IP_DF))
> + if (flags & PFSTATE_NODF && af == AF_INET && h->ip_off & htons(IP_DF)) {
> + old = h->ip_off;
>   h->ip_off &= htons(~IP_DF);
> + pf_cksum_fixup(>ip_sum, old, h->ip_off, 0);
> + }
>  
>   /* Enforce a minimum ttl, may cause endless packet loops */
> - if (min_ttl && af == AF_INET && h->ip_ttl < min_ttl)
> + if (min_ttl && af == AF_INET && h->ip_ttl < min_ttl) {
> + old = h->ip_ttl;
>   h->ip_ttl = min_ttl;
> + pf_cksum_fixup(>ip_sum, old, h->ip_off, 0);
> + }
>  #ifdef INET6
>   if (min_ttl && af == AF_INET6 && h6->ip6_hlim < min_ttl)
>   h6->ip6_hlim = min_ttl;
> @@ -1661,8 +1668,11 @@ pf_scrub(struct mbuf *m, u_int16_t flags
>  
>   /* Enforce tos */
>   if (flags & PFSTATE_SETTOS) {
> - if (af == AF_INET)
> + if (af == AF_INET) {
> + old = *(u_int16_t *)h;
>   h->ip_tos = tos | (h->ip_tos & IPTOS_ECN_MASK);
> + pf_cksum_fixup(>ip_sum, old, *(u_int16_t *)h, 0);
> + }
>  #ifdef INET6
>   if (af == AF_INET6) {
>   /* drugs are unable to explain such idiocy */
> @@ -1674,6 +1684,9 @@ pf_scrub(struct mbuf *m, u_int16_t flags
>  
>   /* random-id, but not for fragments */
>   if (flags & PFSTATE_RANDOMID && af == AF_INET &&
> - !(h->ip_off & ~htons(IP_DF)))
> + !(h->ip_off & ~htons(IP_DF))) {
> + old = h->ip_id;
>   h->ip_id = htons(ip_randomid());
> + pf_cksum_fixup(>ip_sum, old, h->ip_id, 0);
> + }
>  }
> Index: sys/net/pfvar.h
> ===
> RCS file: /cvs/src/sys/net/pfvar.h,v
> retrieving revision 1.510
> diff -u -p -r1.510 pfvar.h
> --- sys/net/pfvar.h   3 Sep 2022 14:57:54 -   

Re: store pf rules in a tree

2022-08-10 Thread Alexandr Nedvedicky
Hello,


On Wed, Aug 10, 2022 at 02:38:16PM +, Stefan Butz wrote:
> Hi everyone,
> 
> this mail includes a patch to store pf rules in a red-black tree.
> Currently they are stored in a linked list.
> My system configured with 16000 rules takes about 10 minutes
> to print them out using `pfctl -sr`.
> This patch decreases the time to 4 seconds.
> I was not able to measure a time increase for rule insertion.
> Inserting a lot of labels is still very slow.
> This has to be attacked separatly.

took me while to figure out why it can take sooo lng
to retrieve 16k rules from kernel. this is the answer

1480 case DIOCGETRULE: {
1496 if (pr->ticket != ruleset->rules.active.ticket) {
1497 error = EBUSY;
1498 PF_UNLOCK();
1499 NET_UNLOCK();
1500 goto fail;
1501 }
1502 rule = TAILQ_FIRST(ruleset->rules.active.ptr);
1503 while ((rule != NULL) && (rule->nr != pr->nr))
1504 rule = TAILQ_NEXT(rule, entries);
1505 if (rule == NULL) {

so yes, in order to retrieve the next rule in list we must
always start with the first (HEAD) rule and iterate to rule n+1.

so I understand a motivation why to solve it using  rb-tree.

I also (as todd) have some doubts about code in DIOCHANGERULE.
the current code is able to insert to desired location.
I don't understand how your new code does that:


1675 
1676 if (pcr->action == PF_CHANGE_ADD_HEAD)
1677 oldrule = RB_MIN(pf_ruletree,
1678 ruleset->rules.active.ptr);
1679 else if (pcr->action == PF_CHANGE_ADD_TAIL)
1680 oldrule = RB_MAX(pf_ruletree,
1681 ruleset->rules.active.ptr);
   
1682 else {
1683 oldrule = RB_MIN(pf_ruletree,
1684 ruleset->rules.active.ptr);
1685 while ((oldrule != NULL) && (oldrule->nr != 
pcr->nr))
1686 oldrule = RB_NEXT(pf_ruletree,
1687 ruleset->rules.active.ptr, oldrule);   
   
1688 if (oldrule == NULL) {
1689 if (newrule != NULL)
1690 pf_rm_rule(NULL, newrule); 
   
1691 error = EINVAL;
   
1692 PF_UNLOCK();
1693 NET_UNLOCK();
1694 goto fail; 
   
1695 }  
   
1696 } 
1697 
1698 if (pcr->action == PF_CHANGE_REMOVE) {
1699 pf_rm_rule(ruleset->rules.active.ptr, oldrule);
   
1700 ruleset->rules.active.rcount--;
1701 } else {
1702 RB_INSERT(pf_ruletree,
1703 ruleset->rules.active.ptr, newrule);
1704 ruleset->rules.active.rcount++;
1705 newrule->ruleset = ruleset;
1706 }
1707 

how does RB_INSERT() at line 1702 insert rule to
desired location. Consider the current rules look like:

[ 1, 2, 3, 4 ]

i'm going to insert rule x, which should follow existing rule nr 3

[ 1, 2, 3, x, 4]

current code just re-indexes/renumbers the rules after insertion

[ 1, 2, 3, 4, 5 ], where x gets assigned 4.

I don't see how does that happen with RB_INSERT().

I think what we really need to fix is the iteration to n-th retrieved
rule we currently have in DIOCGETRULE. I'm working on slightly large
change which updates pf(4) transactions. I'll try to cherry pick
some changes from my in-progress tree and factor out a smaller diff 
which will solve that DIOCGETRULE itch for you. please stay tuned.

thanks and
regards
sashan



Re: ipv4 and ipv6 fragment refactoring

2022-08-12 Thread Alexandr Nedvedicky
Hello,

On Thu, Aug 11, 2022 at 11:59:55AM +0200, Alexander Bluhm wrote:
> Hi,
> 
> ip_fragment() and ip6_fragment() do nearly the same thing, but they
> are implemented differently.
> 
> The idea of this diff is to move things around.  Then only differences
> from the IP standards but not in coding style remain.  This allows
> to compare both functions easily.

don't mind at all if things will be more consistent.
> 
> In IPv4 assume that m_pkthdr is correct and take the length from
> the, like in IPv6.
> 

would it make sense to put asssert there at least for a while
just to discover any surprises? something like this:

> -ip_fragment(struct mbuf *m, struct mbuf_list *fml, struct ifnet *ifp,
> +ip_fragment(struct mbuf *m0, struct mbuf_list *fml, struct ifnet *ifp,
>  u_long mtu)
>  {
> - struct ip *ip, *mhip;
> - struct mbuf *m0;
> - int len, hlen, off;
> - int mhlen, firstlen;
> + struct mbuf *m;
> + struct ip *ip;
> + int firstlen, hlen, tlen, len, off;
>   int error;
>  
>   ml_init(fml);
> - ml_enqueue(fml, m);
> + ml_enqueue(fml, m0);
>  
> - ip = mtod(m, struct ip *);
> + ip = mtod(m0, struct ip *);
>   hlen = ip->ip_hl << 2;
> + tlen = m0->m_pkthdr.len;
KASSERT(tlen == ntohs(ip->ip_len));


diff looks good to me (with or without assert).

OK sashan



Re: ip6 routing header 0 offset

2022-08-12 Thread Alexandr Nedvedicky
Hello,

On Thu, Aug 11, 2022 at 09:42:54PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> The IPv6 routing header type 0 check should modify *offp only in
> case of an error, so that the genrated icmp6 packet has the correct
> pointer.  After successful return, *offp should not be modified.

makes sense to me.
> 
> ok?

OK sashan

> 
> bluhm
> 
> Index: netinet6/ip6_input.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/ip6_input.c,v
> retrieving revision 1.250
> diff -u -p -r1.250 ip6_input.c
> --- netinet6/ip6_input.c  6 Aug 2022 15:57:59 -   1.250
> +++ netinet6/ip6_input.c  11 Aug 2022 19:36:22 -
> @@ -695,21 +695,23 @@ ip6_check_rh0hdr(struct mbuf *m, int *of
>   do {
>   switch (proto) {
>   case IPPROTO_ROUTING:
> - *offp = off;
>   if (rh_cnt++) {
>   /* more than one rh header present */
> + *offp = off;
>   return (1);
>   }
>  
>   if (off + sizeof(rthdr) > lim) {
>   /* packet to short to make sense */
> + *offp = off;
>   return (1);
>   }
>  
>   m_copydata(m, off, sizeof(rthdr), );
>  
>   if (rthdr.ip6r_type == IPV6_RTHDR_TYPE_0) {
> - *offp += offsetof(struct ip6_rthdr, ip6r_type);
> + *offp = off +
> + offsetof(struct ip6_rthdr, ip6r_type);
>   return (1);
>   }
>  
> 



Re: store pf rules in a tree

2022-08-12 Thread Alexandr Nedvedicky
Hello Stefan,

On Wed, Aug 10, 2022 at 02:38:16PM +, Stefan Butz wrote:
> Hi everyone,
> 
> this mail includes a patch to store pf rules in a red-black tree.
> Currently they are stored in a linked list.
> My system configured with 16000 rules takes about 10 minutes
> to print them out using `pfctl -sr`.
> This patch decreases the time to 4 seconds.
> I was not able to measure a time increase for rule insertion.
> Inserting a lot of labels is still very slow.
> This has to be attacked separatly.
> 
> 

below is a diff which should make DIOCGETRULE fast enough. It introduces
a transaction (struct pf_trans).  it comes from my 'work-in-progress' tree.
I'd like to introduce transactions to move all memory allocations done on
behalf of ioctl(2) in pf(4) outside of PF_LOCK() scope.  Seeing your diff
I've realized we can use pf_trans also to speed up DIOCGETRULE.


diff below is based on my work-in-progress tree/branch. There might be some
rough edges, but it compiles and seems to work. if people will like the idea
I'll polish the change and commit it. It will also help me to make upcoming
diffs smaller.

At this point I'd like to know if idea presented in diff below makes
kind of sense.

there is `struct pf_trans` which keeps context for operations
which involve multiple ioctl(2) calls (such as ADDRULE, GETRULE,...)

ruleset.ticket is renamed to ruleset.version because ticket is know
kept in newly introduced pf_trans.

the workflow for process which wants to read/modify ruleset. In case
of DIOCGETRULES/DIOCGETRULE operations the workflow looks as follows:

transaction (pf_trans) is created by DIOCGETRULES

transaction gets reference to anchor/ruleset

we also remember current version of ruleset in transaction

we also rmemeber a pointer to the first rule in ruleset/anchor

we return transaction ticket to pfctl(8) process

then pfctl(8) issues DIOCGETRULE, it passes transaction ticket to ioctl(2)

we use ticket to look up a transaction created by DIOCGETRULES

we check if version kept in transaction matches the version at
ruleset.  If there is a mismatch ruleset got modified and we bail out.

if rule stored in transaction is NULL we return ENOENT
to indicate ruleset is either empty or there are no more rules

if version matches then we may safely proceed, the rule pointer
must be valid because matching version number warrants nothing
was modified.

we copy rule to buffer

and we store a next rule in transaction.

all transactions created by process are currently destroyed in pfclose()

thanks for thoughts/feedback.

regards
sashan

8<---8<---8<--8<
diff --git a/sbin/pfctl/pfctl.c b/sbin/pfctl/pfctl.c
index 9f14b955ca7..cf9def71c02 100644
--- a/sbin/pfctl/pfctl.c
+++ b/sbin/pfctl/pfctl.c
@@ -833,7 +833,7 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
 char *anchorname, int depth, int wildcard, long shownr)
 {
struct pfioc_rule pr;
-   u_int32_t nr, mnr, header = 0;
+   u_int32_t header = 0;
int len = strlen(path), ret = 0;
char *npath, *p;
 
@@ -884,24 +884,9 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
goto error;
}
 
-   if (shownr < 0) {
-   mnr = pr.nr;
-   nr = 0;
-   } else if (shownr < pr.nr) {
-   nr = shownr;
-   mnr = shownr + 1;
-   } else {
-   warnx("rule %ld not found", shownr);
-   ret = -1;
-   goto error;
-   }
-   for (; nr < mnr; ++nr) {
-   pr.nr = nr;
-   if (ioctl(dev, DIOCGETRULE, ) == -1) {
-   warn("DIOCGETRULE");
-   ret = -1;
-   goto error;
-   }
+   while (ioctl(dev, DIOCGETRULE, ) != -1) {
+   if ((shownr != -1) && (shownr != pr.nr))
+   continue;
 
/* anchor is the same for all rules in it */
if (pr.rule.anchor_wildcard == 0)
@@ -956,6 +941,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
case PFCTL_SHOW_NOTHING:
break;
}
+   errno = 0;
+   }
+
+   if ((errno != 0) && (errno != ENOENT)) {
+   warn("DIOCGETRULE");
+   ret = -1;
+   goto error;
}
 
/*
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index b540afc6fe2..6ef89d162b3 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -116,6 +116,11 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int pf_states_get(struct pfioc_states *);
 
+struct pf_trans

Re: [External] : inpcb lookup ref counting

2022-08-08 Thread Alexandr Nedvedicky
Hello,

diff looks good to me as far as I can tell.

OK sashan@



patch for 7.2 to fix pfsync panics in pf_state_export()

2023-01-09 Thread Alexandr Nedvedicky
Hello,

if you don't use pfsync on 7.2 stop reading now.

There are two reports already [1] which have the same cause:
NULL pointer dereference in pf_state_export().

dlg@ has fixed this bug in CUURENT back in December [2]. Unfortunately
the patch for 7.2 needs a manual merge, because CURRENT slightly differs.

Below you can find a patch for 7.2. There is a positive feedback from Tamas [3].
I'd like to ask more people who run pfsync on 7.2 to give it a try and
come back with report. Please also keep an eye on memory leaks:

vmstat -m |egrep -e '^Name|^pfst'

the InUse counter should be oscilating around some mean value which depends
on your network load.

thanks for your help
regards
sashan


[1] https://marc.info/?l=openbsd-bugs=167195892512328=2
https://marc.info/?l=openbsd-bugs=167318980023225=2

[2] https://marc.info/?l=openbsd-cvs=167115592605571=2

[3] https://marc.info/?l=openbsd-bugs=167319072223544=2

8<8<---8<-8<

Index: sys/net/pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1140.2.1
diff -u -p -u -r1.1140.2.1 pf.c
--- sys/net/pf.c24 Nov 2022 22:51:23 -  1.1140.2.1
+++ sys/net/pf.c9 Jan 2023 21:55:58 -
@@ -185,6 +185,8 @@ int  pf_translate_icmp_af(struct pf_pd
 voidpf_send_icmp(struct mbuf *, u_int8_t, u_int8_t, int,
sa_family_t, struct pf_rule *, u_int);
 voidpf_detach_state(struct pf_state *);
+struct pf_state_key*pf_state_key_attach(struct pf_state_key *,
+struct pf_state *, int);
 voidpf_state_key_detach(struct pf_state *, int);
 u_int32_t   pf_tcp_iss(struct pf_pdesc *);
 voidpf_rule_to_actions(struct pf_rule *,
@@ -723,17 +725,26 @@ pf_state_compare_id(struct pf_state *a, 
return (0);
 }
 
-int
+/*
+ * on failure, pf_state_key_attach() releases the pf_state_key
+ * reference and returns NULL.
+ */
+struct pf_state_key *
 pf_state_key_attach(struct pf_state_key *sk, struct pf_state *s, int idx)
 {
struct pf_state_item*si;
struct pf_state_key *cur;
struct pf_state *olds = NULL;
 
+   PF_ASSERT_LOCKED();
+
KASSERT(s->key[idx] == NULL);
-   if ((cur = RB_INSERT(pf_state_tree, _statetbl, sk)) != NULL) {
+   sk->removed = 0;
+   cur = RB_INSERT(pf_state_tree, _statetbl, sk);
+   if (cur != NULL) {
+   sk->removed = 1;
/* key exists. check for same kif, if none, add to key */
-   TAILQ_FOREACH(si, >states, entry)
+   TAILQ_FOREACH(si, >states, entry) {
if (si->s->kif == s->kif &&
((si->s->key[PF_SK_WIRE]->af == sk->af &&
 si->s->direction == s->direction) ||
@@ -769,44 +780,53 @@ pf_state_key_attach(struct pf_state_key 
/* remove late or sks can go away */
olds = si->s;
} else {
-   pool_put(_state_key_pl, sk);
-   return (-1);/* collision! */
+   pf_state_key_unref(sk);
+   return (NULL);  /* collision! */
}
}
-   pool_put(_state_key_pl, sk);
-   s->key[idx] = cur;
-   } else
-   s->key[idx] = sk;
+   }
+
+   /* reuse the existing state key */
+   pf_state_key_unref(sk);
+   sk = cur;
+   }
 
if ((si = pool_get(_state_item_pl, PR_NOWAIT)) == NULL) {
-   pf_state_key_detach(s, idx);
-   return (-1);
+   if (TAILQ_EMPTY(>states)) {
+   KASSERT(cur == NULL);
+   RB_REMOVE(pf_state_tree, _statetbl, sk);
+   sk->removed = 1;
+   pf_state_key_unref(sk);
+   }
+
+   return (NULL);
}
-   si->s = s;
+
+   s->key[idx] = pf_state_key_ref(sk); /* give a ref to state */
+   si->s = pf_state_ref(s);
 
/* list is sorted, if-bound states before floating */
if (s->kif == pfi_all)
-   TAILQ_INSERT_TAIL(>key[idx]->states, si, entry);
+   TAILQ_INSERT_TAIL(>states, si, entry);
else
-   TAILQ_INSERT_HEAD(>key[idx]->states, si, entry);
+   TAILQ_INSERT_HEAD(>states, si, entry);
 
if (olds)
pf_remove_state(olds);
 
-   return (0);
+   /* caller owns the pf_state ref, which owns a pf_state_key ref now */
+   return (sk);
 }
 
 void
 pf_detach_state(struct pf_state *s)
 {
-   if (s->key[PF_SK_WIRE] == 

Re: move the pf_state_tree type around

2023-01-02 Thread Alexandr Nedvedicky
Hello,

On Tue, Jan 03, 2023 at 03:31:42PM +1000, David Gwynne wrote:
> the pf_state_tree type represents the rb tree that pf_state_key structs
> go into. currently pf_state_key is declared in pfvar_priv.h (because
> it's a kernel private data structure) but pf_state_tree was left in
> pfvar.h. this moves it to pfvar_priv.h, because it is also kernel
> private.
> 
> while here, this moves it from the RB tree macros to RBT which shrinks
> pf.o by about 13k on amd64.

diff reads OK to me. Do I assume right 'tree_id' tree will
be also moved from RB_*() to RBT_() in another commit?

thanks and OK sashan



Re: more consistently use "st" as the name for struct pf_state variables

2023-01-05 Thread Alexandr Nedvedicky
Hello,

On Wed, Jan 04, 2023 at 09:36:38PM +1000, David Gwynne wrote:
> and "stp" for pf_state ** variables.
> 
I agree with established naming conventions.

I'm also fine with keeping some exceptions such as `a` and `b`
in pf_state_compare_id(), local variables `tail`, `head`
in pf_states_{clr, get}() and pf_purge_expired_states().
I'm also fine with leaving static variable `cur` unchanged.

is there any reason we still keep `pf_state **sm` argument
in pf_test_rule()? the same in pf_create_state(). Is it intended?

otherwise diff reads OK to me.

thanks and
regards
sashan



Re: move the pf_state_tree_id type around

2023-01-04 Thread Alexandr Nedvedicky
Hello,

I agree with change as-is. Though I have some suggestions
for few finishing touches. see comments inline.

On Wed, Jan 04, 2023 at 01:15:54PM +1000, David Gwynne wrote:
> move the pf_state_tree_id type from pfvar.h to pfvar_priv.h.
> 
> this is like the pf_state_tree change from yesterday. the
> pf_state_tree_id type is private to the kernel, just like the
> pf_state_tree type.
> 
> moving it from RB to RBT saves another 12k in pf.o on amd64.
> 
> i also changed some hand rolled for loops over to RBT_FOREACH_SAFE.
> they look right?

All those look good to me. Suggestion below are just
suggestions which I don't insist on.

> 
> ok?
> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1166
> diff -u -p -r1.1166 pf.c
> --- pf.c  4 Jan 2023 02:00:49 -   1.1166
> +++ pf.c  4 Jan 2023 03:08:04 -

> @@ -1054,7 +1053,7 @@ pf_state_insert(struct pfi_kif *kif, str
>   s->id = htobe64(pf_status.stateid++);
>   s->creatorid = pf_status.hostid;
>   }
> - if (RB_INSERT(pf_state_tree_id, _id, s) != NULL) {
> + if (RBT_INSERT(pf_state_tree_id, _id, s) != NULL) {
>   if (pf_status.debug >= LOG_NOTICE) {
>   log(LOG_NOTICE, "pf: state insert failed: "
>   "id: %016llx creatorid: %08x",
> @@ -1085,7 +1084,7 @@ pf_find_state_byid(struct pf_state_cmp *
>  {
>   pf_status.fcounters[FCNT_STATE_SEARCH]++;
>  
> - return (RB_FIND(pf_state_tree_id, _id, (struct pf_state *)key));
> + return (RBT_FIND(pf_state_tree_id, _id, (struct pf_state *)key));
>  }

would it make sense to rename argument `s` to `st`? Just to lean
towards consistency in pf(4). `st` usually denotes state.


> Index: pf_ioctl.c
> ===
> RCS file: /cvs/src/sys/net/pf_ioctl.c,v
> retrieving revision 1.394
> diff -u -p -r1.394 pf_ioctl.c
> --- pf_ioctl.c4 Jan 2023 02:00:49 -   1.394
> +++ pf_ioctl.c4 Jan 2023 03:08:05 -
> @@ -1796,10 +1796,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
>   NET_LOCK();
>   PF_LOCK();
>   PF_STATE_ENTER_WRITE();
> - for (s = RB_MIN(pf_state_tree_id, _id); s;
> - s = nexts) {
> - nexts = RB_NEXT(pf_state_tree_id, _id, s);
> -
> + RBT_FOREACH_SAFE(s, pf_state_tree_id, _id, nexts) {
>   if (s->direction == PF_OUT) {
>   sk = s->key[PF_SK_STACK];
>   srcaddr = >addr[1];

again: same change as earlier just rename `s` to `st` when
already there.

> @@ -2828,7 +2825,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
>   NET_LOCK();
>   PF_LOCK();
>   PF_STATE_ENTER_WRITE();
> - RB_FOREACH(state, pf_state_tree_id, _id)
> + RBT_FOREACH(state, pf_state_tree_id, _id)
>   pf_src_tree_remove_state(state);
>   PF_STATE_EXIT_WRITE();
>   RB_FOREACH(n, pf_src_tree, _src_tracking)
> @@ -2861,7 +2858,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
>   if (sn->states != 0) {
>   PF_ASSERT_LOCKED();
>   PF_STATE_ENTER_WRITE();
> - RB_FOREACH(s, pf_state_tree_id,
> + RBT_FOREACH(s, pf_state_tree_id,
>  _id)
>   pf_state_rm_src_node(s, sn);
>   PF_STATE_EXIT_WRITE();

in this case branch too, rename `state` to `st`.

> Index: pf_lb.c
> ===
> RCS file: /cvs/src/sys/net/pf_lb.c,v
> retrieving revision 1.72
> diff -u -p -r1.72 pf_lb.c
> --- pf_lb.c   31 Aug 2022 11:29:12 -  1.72
> +++ pf_lb.c   4 Jan 2023 03:08:05 -
> @@ -311,10 +311,8 @@ pf_map_addr_sticky(sa_family_t af, struc
>   }
>   if (sns[type]->states != 0) {
>   /* XXX expensive */
> - RB_FOREACH(s, pf_state_tree_id,
> -_id)
> - pf_state_rm_src_node(s,
> - sns[type]);
> + RBT_FOREACH(s, pf_state_tree_id, _id)
> + pf_state_rm_src_node(s, sns[type]);
>   }

rename `s` to `st`


>   sns[type]->expire = 1;
>   pf_remove_src_node(sns[type]);
> Index: pfvar.h
> ===
> RCS file: /cvs/src/sys/net/pfvar.h,v
> retrieving revision 1.526
> diff -u -p -r1.526 pfvar.h
> --- pfvar.h   4 Jan 2023 02:00:49 -   1.526
> +++ pfvar.h   4 Jan 2023 03:08:05 

Re: help pfsync by extending pf_state_key lifetimes on pf_states

2022-12-04 Thread Alexandr Nedvedicky
Hello,


On Sat, Dec 03, 2022 at 09:53:45AM +1000, David Gwynne wrote:
> we (mostly sashan@ and me) have a problem where pfsync can be holding a
> reference to a pf_state that pf has decided to purge, and then pfsync
> crashes because it tries to read the pf_state_key parts of the state,
> but they been removed by the purge process.
> 
> the background to this is that pf_state structures don't contain ip
> addresses, protocols, ports, etc. that information is stored in a
> pf_state_key struct, which is used to wire a state into the state table.
> when pfsync wants to export information about a state, particularly the
> addresses on it, it needs the pf_state_key struct to read from.
> 
> the pf purge code current assumes that when a state is removed from the
> state table, the pf_state_key structs that are used to wire it in can
> also be removed and freed. this has obvious bad consequences for pfsync,
> which could still be holding a reference to the pf_state struct with the
> intention of reading from it.
> 
> this diff tweaks the lifetime of pf_state_structs so they at least match
> the lifetime of pf_states.
> 
> just so i'm clear, the main idea is that once pf_state_insert succeeds,
> a pf_state struct will always have pf_state_key structs hanging off it.
> it's only after the pf_state struct itself is being torn down that its
> references to pf_state_keys are released. if you're holding a
> pf_state reference, you can use that to access the pf_state_key
> structs hanging off it.
> 
> this is largely accomplished by taking more advantage of the pf_state
> and pf_state_key refcnts. pf_states get properly accounted references to
> the pf_state_keys, and the state table gets its own reference counts to
> the pf_state_keys too. when a state is removed from the state table, the
> state table reference counts drop. however, it's not until all pf_state
> references drop that pf_state will give up its references to the
> pf_states.
> 
> this is a very new diff, but i've been kicking it around with pfsync a
> bit and the pf_state_export crashes i used to get are gone now.
> 
> if you're going to test the things to look for are if NAT still works,
> and if the pfstate and pfstkey pool numbers still look right.

all above makes sense to me. I'm fine with the approach.
I think I could just one single glitch here in pf_state_key_attach():

> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1156
> diff -u -p -r1.1156 pf.c
> --- pf.c  25 Nov 2022 20:27:53 -  1.1156
> +++ pf.c  2 Dec 2022 23:20:36 -

> @@ -766,44 +777,52 @@ pf_state_key_attach(struct pf_state_key 
>   /* remove late or sks can go away */
>   olds = si->s;
>   } else {
> - pool_put(_state_key_pl, sk);
> - return (-1);/* collision! */
> + pf_state_key_unref(sk);
> + return (NULL);  /* collision! */
>   }
>   }
> - pool_put(_state_key_pl, sk);
> - s->key[idx] = cur;
> - } else
> - s->key[idx] = sk;
> + }
> +
> + /* reuse the existing state key */
> + pf_state_key_unref(sk);
> + sk = cur;
> + }
>  
>   if ((si = pool_get(_state_item_pl, PR_NOWAIT)) == NULL) {
> - pf_state_key_detach(s, idx);
> - return (-1);
> + if (TAILQ_EMPTY(>states)) {
> + RB_REMOVE(pf_state_tree, _statetbl, sk);
> + sk->removed = 1;
> + }
> +
> + pf_state_key_unref(sk);

I think pf_state_key_unref() above needs to happen if and only if
sk != cur.

otherwise the diff looks good to me.

thanks and
regards
sashan



Re: help pfsync by extending pf_state_key lifetimes on pf_states

2022-12-05 Thread Alexandr Nedvedicky
Hello,
On Mon, Dec 05, 2022 at 11:46:07AM +1000, David Gwynne wrote:
> 
> yes, you're right. the diff below includes the simple fix to that.
> 
> this demonstrates how subtle the reference counting around the trees
> is though. maybe it would be more obvious to take another refcnt
> before giving the pf_state_key to the tree(s), and then give the
> pf_state_key_attach callers reference to the pf_state struct. at
> the moment we give the callers reference to the tree while holding
> the PF_LOCK, and borrow it to give another one to the state while
> the lock is still held.

Either way is fine in my opinion. 


> @@ -766,44 +777,53 @@ pf_state_key_attach(struct pf_state_key 
>   /* remove late or sks can go away */
>   olds = si->s;
>   } else {
> - pool_put(_state_key_pl, sk);
> - return (-1);/* collision! */
> + pf_state_key_unref(sk);
> + return (NULL);  /* collision! */
>   }
>   }
> - pool_put(_state_key_pl, sk);
> - s->key[idx] = cur;
> - } else
> - s->key[idx] = sk;
> + }
> +
> + /* reuse the existing state key */
> + pf_state_key_unref(sk);
> + sk = cur;
> + }
>  
>   if ((si = pool_get(_state_item_pl, PR_NOWAIT)) == NULL) {
> - pf_state_key_detach(s, idx);
> - return (-1);
> + if (TAILQ_EMPTY(>states)) {
> + KASSERT(cur == NULL);
> + RB_REMOVE(pf_state_tree, _statetbl, sk);
> + sk->removed = 1;
> + pf_state_key_unref(sk);
> + }
> +
> + return (NULL);
>   }

I like your fix above.

I also wonder if pf_state_insert() should be always setting *skwp, *sksp to
NULL. Just to indicate to caller the 'reference ownership got transferred' and
there is nothing left to worry about. This can be done right at the beginning
of pf_state_insert() after we assign to local variables skw/sks. Just a
thought/nit.

the recent diff is OK by me.

sashan



Re: help pfsync by extending pf_state_key lifetimes on pf_states

2022-12-05 Thread Alexandr Nedvedicky
Hello,

> 
> pf_test_rule uses the skw and sks pointers after pf_state_insert sets
> them via pf_create_state. i would happily change pf_test rule so it
> reads the pf_state_key pointers out of pf_state rather than carry them
> around on the stack like that, but i figured the diff was big enough as
> it was.

I've forgot about pf_test_rule(). perhaps just setting skw/sks
to NULL on failure (with reference gone) should be sufficient
at this point.
> 
> > the recent diff is OK by me.
> 
> thanks. i'll wait until i can be around to look after it before i commit
> it.

thanks and
regards
sashan



Re: move pf_state_key and pf_state_item structs from pfvar.h to pfvar_priv.h

2022-12-15 Thread Alexandr Nedvedicky
Hello,

On Fri, Dec 16, 2022 at 01:53:42PM +1000, David Gwynne wrote:
> both struct pf_state_key and pf_state_item are kernel private data
> structures, and do not need to be visible to userland.
> 
> i would like to move them to pfvar_priv.h to make it more obvious that
> they are and should remain kernel private data structures, which in turn
> will make it less scary to tweak them in the future.
> 
> this has survivied a make build.
> 
> ok?

no objection, diff reads OK to me.

thanks and
regards
sashan



Re: rename pf_state_key and pf_state_item members

2022-12-19 Thread Alexandr Nedvedicky
Hello,

I don't object. diff improces things. Just few bike shedding nits
further below.
On Mon, Dec 19, 2022 at 04:48:57PM +1000, David Gwynne wrote:

> 
> i hope i didn't mess up the (sk_)states list traversals.

could not spot anything wrong there.


> 
> ok?
> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1157
> diff -u -p -r1.1157 pf.c
> --- pf.c  16 Dec 2022 02:05:44 -  1.1157
> +++ pf.c  19 Dec 2022 06:39:45 -
> @@ -315,7 +315,7 @@ struct pf_state_tree_id tree_id;
>  struct pf_state_list pf_state_list = 
> PF_STATE_LIST_INITIALIZER(pf_state_list);
>  
>  RB_GENERATE(pf_src_tree, pf_src_node, entry, pf_src_compare);
> -RB_GENERATE(pf_state_tree, pf_state_key, entry, pf_state_compare_key);
> +RB_GENERATE(pf_state_tree, pf_state_key, sk_entry, pf_state_compare_key);
>  RB_GENERATE(pf_state_tree_id, pf_state,
>  entry_id, pf_state_compare_id);

I think pf_state_tree prototype should be moved to net/pfvar_priv.h perhaps
others should be moved as well, because those trees are private to pf(4).
can be done in follow up commit.

>  
> @@ -736,24 +736,25 @@ pf_state_key_attach(struct pf_state_key 
>   PF_ASSERT_LOCKED();
>  
>   KASSERT(s->key[idx] == NULL);
> - sk->removed = 0;
> + sk->sk_removed = 0;
>   cur = RB_INSERT(pf_state_tree, _statetbl, sk);
>   if (cur != NULL) {
> - sk->removed = 1;
> + sk->sk_removed = 1;
>   /* key exists. check for same kif, if none, add to key */
> - TAILQ_FOREACH(si, >states, entry) {
> - if (si->s->kif == s->kif &&
> - ((si->s->key[PF_SK_WIRE]->af == sk->af &&
> -  si->s->direction == s->direction) ||
> - (si->s->key[PF_SK_WIRE]->af !=
> -  si->s->key[PF_SK_STACK]->af &&
> -  sk->af == si->s->key[PF_SK_STACK]->af &&
> -  si->s->direction != s->direction))) {
> + TAILQ_FOREACH(si, >sk_states, si_entry) {
> + struct pf_state *tst = si->si_st;

appreciate consistency in your diff. it uses 'tst = si->si_st;'
however going for 'sis' instead of 'tst' would remind us data here
come from state item. This is just a nit. Don't feel strongly
about it.

diff reads OK to me.

thanks and
regards
sashan



Re: rename pf_state_key and pf_state_item members

2022-12-19 Thread Alexandr Nedvedicky
Hello,

On Tue, Dec 20, 2022 at 08:10:30AM +1000, David Gwynne wrote:

> > > -  si->s->direction != s->direction))) {
> > > + TAILQ_FOREACH(si, >sk_states, si_entry) {
> > > + struct pf_state *tst = si->si_st;
> > 
> > appreciate consistency in your diff. it uses 'tst = si->si_st;'
> > however going for 'sis' instead of 'tst' would remind us data here
> > come from state item. This is just a nit. Don't feel strongly
> > about it.
> 
> that makes a lot of sense to me. how about 'sist' instead? 's' for state
> feels weird to me after years of 's = splfol()', 'st' seems more
> explicit.

'sist' is perfect.

> 
> > diff reads OK to me.
> 
> how about this one?

Still reds OK and I like it.

OK sashan



Re: interface hooks to pf(4) should be using PF_LOCK()/PF_UNLOCK()

2022-11-22 Thread Alexandr Nedvedicky
Hello,

this change is required to unhook pf(4) from NET_LOCK().
therefore I'd like to get it in.


On Mon, Nov 07, 2022 at 04:51:59AM +1000, David Gwynne wrote:
> 
> 
> > On 7 Nov 2022, at 4:12 am, Alexandr Nedvedicky  wrote:
> > 
> > Hello,
> > 
> > diff below is the first step to make pfioctl() _without_ NET_LOCK().
> > Currently pf_if.c seems to be the only blocker which prevents us
> > from removing all NET_LOCK()/NET_UNLOCK() calls we have in pf(4).
> > 
> > diff below passed very basic smoke test. OK to commit?
> > 
> > 
> > thanks and
> > regards
> > sashan
> > 
> > 8<---8<---8<--8<
> > diff --git a/sys/net/pf_if.c b/sys/net/pf_if.c
> > index e23c14e6769..e86d85aa2c4 100644
> > --- a/sys/net/pf_if.c
> > +++ b/sys/net/pf_if.c
> > @@ -53,10 +53,17 @@
> > 
> > #include 
> > 
> > +#include 
> > +#include 
> > +#include 
> 
> do we need this chunk?
> 

chunk above is required. It got sucked as a dependency for
pfvar_priv.h We need pfvar_priv.h because it provides definitions
for PF_LOCK(). With PF_LOCK() we are also getting definition of pf_pdesc,
which requires header files above.

thanks and
regards
sashan



Re: splassert on boot

2022-11-23 Thread Alexandr Nedvedicky
Hello,

On Thu, Nov 24, 2022 at 08:51:51AM +1000, David Gwynne wrote:
> im ok with this, but you need sashan@ to ok it too. he's been working up to 
> this anyway.
> 
> dlg
> 
> > On 24 Nov 2022, at 06:18, Vitaliy Makkoveev  wrote:
> > 
> > On Wed, Nov 23, 2022 at 02:59:05PM -0500, David Hill wrote:
> >> Hello -
> >> 
> >> I am seeing splasserts on boot (before kern.splassert=2 can be set) with
> >> -current.
> >> 
> >> 
> >> 
> >> spdmem0 at iic0 addr 0x50: 8GB DDR3 SDRAM PC3-12800 SO-DIMM
> >> isa0 at pcib0
> >> isadma0 at isa0
> >> vga0 at isa0 port 0x3b0/48 iomem 0xa/131072
> >> wsdisplay at vga0 not configured
> >> pcppi0 at isa0 port 0x61
> >> spkr0 at pcppi0
> >> vmm0 at mainbus0: VMX/EPT (using slow L1TF mitigation)
> >> splassert: pfi_attach_ifgroup: want 2 have 0
> >> splassert: pfi_group_addmember: want 2 have 0
> >> splassert: pfi_attach_ifgroup: want 2 have 0
> >> splassert: pfi_attach_ifgroup: want 2 have 0
> >> splassert: pfi_group_addmember: want 2 have 0
> >> 
> >> 
> >> - David
> >> 
> > 
> > The netlock assertion within PF_LOCK() looks wrong. The netlock should
> > be taken first, but only if both locks taken.
> > 
> > Index: sys/net/pfvar_priv.h
> > ===
> > RCS file: /cvs/src/sys/net/pfvar_priv.h,v
> > retrieving revision 1.21
> > diff -u -p -r1.21 pfvar_priv.h
> > --- sys/net/pfvar_priv.h11 Nov 2022 17:12:30 -  1.21
> > +++ sys/net/pfvar_priv.h23 Nov 2022 20:14:13 -
> > @@ -278,7 +278,6 @@ extern struct rwlockpf_lock;
> > extern struct rwlockpf_state_lock;
> > 
> > #define PF_LOCK()   do {\
> > -   NET_ASSERT_LOCKED();\
> > rw_enter_write(_lock);   \
> > } while (0)
> > 
> 

this diff is OK, please go ahead and commit it.

sorry, I have not hit this assert on my boxes, when testing this diff.

OK sashan




Re: pfsync slpassert on boot and panic

2022-11-25 Thread Alexandr Nedvedicky
Hello Hrvoje,


On Fri, Nov 25, 2022 at 09:57:15AM +0100, Hrvoje Popovski wrote:
> Hi,
> 
> I think that this is similar problem as what David Hill send on tech@
> with subject "splassert on boot"
> 
> I've checkout tree few minutes ago and in there should be
> mvs@ "Remove netlock assertion within PF_LOCK()" and
> dlg@ "get rid of NET_LOCK in the pf purge work" diffs.
> 
> on boot I'm getting this splassert
> 
> splassert: pfsync_delete_state: want 2 have 256
> Starting stack trace...
> pfsync_delete_state(fd83a66644d8) at pfsync_delete_state+0x58
> pf_remove_state(fd83a66644d8) at pf_remove_state+0x14b
> pf_purge_expired_states(1fdb,40) at pf_purge_expired_states+0x202
> pf_purge_states(0) at pf_purge_states+0x1c
> taskq_thread(822f69c8) at taskq_thread+0x11a
> end trace frame: 0x0, count: 252
> End of stack trace.

I've sent a diff yesterday to David Hill [1]. It looks like
I've forgot to add cc' to tach@


> splassert: pfsync_delete_state: want 2 have 0
> Starting stack trace...
> pfsync_delete_state(fd83a6676628) at pfsync_delete_state+0x58
> pf_remove_state(fd83a6676628) at pf_remove_state+0x14b
> pf_purge_expired_states(1f9c,40) at pf_purge_expired_states+0x202
> pf_purge_states(0) at pf_purge_states+0x1c
> taskq_thread(822f69c8) at taskq_thread+0x11a
> end trace frame: 0x0, count: 252
> End of stack trace.
> 
> 
> and if i destroy pfsync interface and then sh /etc/netstart box panic
> 
> uvm_fault(0x823d3250, 0x810, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at  pfsync_q_ins+0x1a:  movq0x810(%r13),%rsi
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> * 68977  95532  0 0x14000  0x2003K systqmp
> pfsync_q_ins(fd83a6676628,2) at pfsync_q_ins+0x1a
> pf_remove_state(fd83a6676628) at pf_remove_state+0x14b
> pf_purge_expired_states(1f9c,40) at pf_purge_expired_states+0x202
> pf_purge_states(0) at pf_purge_states+0x1c
> taskq_thread(822f69c8) at taskq_thread+0x11a
> end trace frame: 0x0, count: 10
> https://www.openbsd.org/ddb.html describes the minimum info required in
> bug reports.  Insufficient info makes it difficult to find and fix bugs.
> ddb{3}>
> 

Looks like we need to synchronize pfsync destroy with timer thread.

thanks for great testing.

regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index f69790ee98d..24963a546de 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -1865,8 +1865,6 @@ pfsync_undefer(struct pfsync_deferral *pd, int drop)
 {
struct pfsync_softc *sc = pfsyncif;
 
-   NET_ASSERT_LOCKED();
-
if (sc == NULL)
return;
 
@@ -2128,8 +2126,6 @@ pfsync_delete_state(struct pf_state *st)
 {
struct pfsync_softc *sc = pfsyncif;
 
-   NET_ASSERT_LOCKED();
-
if (sc == NULL || !ISSET(sc->sc_if.if_flags, IFF_RUNNING))
return;
 
@@ -2188,8 +2184,6 @@ pfsync_clear_states(u_int32_t creatorid, const char 
*ifname)
struct pfsync_clr clr;
} __packed r;
 
-   NET_ASSERT_LOCKED();
-
if (sc == NULL || !ISSET(sc->sc_if.if_flags, IFF_RUNNING))
return;
 



make state key dereference safe for pfsync(4)

2022-11-16 Thread Alexandr Nedvedicky
Hello,

with state key mutex in a tree [1]. I'd like to add yet another diff.
during h2k22 David and I split original change [2] into two chunks.

OK to commit diff below?

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-cvs=166817856414079=2

[2] https://marc.info/?l=openbsd-bugs=166006758231954=2

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index f69790ee98d..e1a006a8faf 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -157,16 +157,16 @@ const struct {
 };
 
 struct pfsync_q {
-   void(*write)(struct pf_state *, void *);
+   int (*write)(struct pf_state *, void *);
size_t  len;
u_int8_taction;
 };
 
 /* we have one of these for every PFSYNC_S_ */
-void   pfsync_out_state(struct pf_state *, void *);
-void   pfsync_out_iack(struct pf_state *, void *);
-void   pfsync_out_upd_c(struct pf_state *, void *);
-void   pfsync_out_del(struct pf_state *, void *);
+intpfsync_out_state(struct pf_state *, void *);
+intpfsync_out_iack(struct pf_state *, void *);
+intpfsync_out_upd_c(struct pf_state *, void *);
+intpfsync_out_del(struct pf_state *, void *);
 
 struct pfsync_q pfsync_qs[] = {
{ pfsync_out_iack,  sizeof(struct pfsync_ins_ack), PFSYNC_ACT_INS_ACK },
@@ -1301,24 +1301,26 @@ pfsyncioctl(struct ifnet *ifp, u_long cmd, caddr_t data)
return (0);
 }
 
-void
+int
 pfsync_out_state(struct pf_state *st, void *buf)
 {
struct pfsync_state *sp = buf;
 
pf_state_export(sp, st);
+   return (0);
 }
 
-void
+int
 pfsync_out_iack(struct pf_state *st, void *buf)
 {
struct pfsync_ins_ack *iack = buf;
 
iack->id = st->id;
iack->creatorid = st->creatorid;
+   return (0);
 }
 
-void
+int
 pfsync_out_upd_c(struct pf_state *st, void *buf)
 {
struct pfsync_upd_c *up = buf;
@@ -1329,9 +1331,10 @@ pfsync_out_upd_c(struct pf_state *st, void *buf)
pf_state_peer_hton(>dst, >dst);
up->creatorid = st->creatorid;
up->timeout = st->timeout;
+   return (0);
 }
 
-void
+int
 pfsync_out_del(struct pf_state *st, void *buf)
 {
struct pfsync_del_c *dp = buf;
@@ -1340,6 +1343,7 @@ pfsync_out_del(struct pf_state *st, void *buf)
dp->creatorid = st->creatorid;
 
SET(st->state_flags, PFSTATE_NOSYNC);
+   return (0);
 }
 
 void
@@ -1671,8 +1675,8 @@ pfsync_sendout(void)
KASSERT(st->snapped == 1);
st->sync_state = PFSYNC_S_NONE;
st->snapped = 0;
-   pfsync_qs[q].write(st, m->m_data + offset);
-   offset += pfsync_qs[q].len;
+   if (pfsync_qs[q].write(st, m->m_data + offset) == 0)
+   offset += pfsync_qs[q].len;
 
pf_state_unref(st);
count++;
diff --git a/sys/net/if_pfsync.h b/sys/net/if_pfsync.h
index ff26ac3669e..cc1d78d47a3 100644
--- a/sys/net/if_pfsync.h
+++ b/sys/net/if_pfsync.h
@@ -326,7 +326,7 @@ int pfsync_sysctl(int *, u_int,  void *, 
size_t *,
 #definePFSYNC_SI_CKSUM 0x02
 #definePFSYNC_SI_ACK   0x04
 intpfsync_state_import(struct pfsync_state *, int);
-void   pfsync_state_export(struct pfsync_state *,
+intpfsync_state_export(struct pfsync_state *,
struct pf_state *);
 
 void   pfsync_insert_state(struct pf_state *);
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 4203911ea60..3653e1359cf 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -181,7 +181,8 @@ int  pf_translate_icmp_af(struct pf_pdesc*, 
int, void *);
 voidpf_send_icmp(struct mbuf *, u_int8_t, u_int8_t, int,
sa_family_t, struct pf_rule *, u_int);
 voidpf_detach_state(struct pf_state *);
-voidpf_state_key_detach(struct pf_state *, int);
+voidpf_state_key_detach(struct pf_state *,
+   struct pf_state_key *);
 u_int32_t   pf_tcp_iss(struct pf_pdesc *);
 voidpf_rule_to_actions(struct pf_rule *,
struct pf_rule_actions *);
@@ -256,6 +257,9 @@ void 
pf_state_key_unlink_inpcb(struct pf_state_key *);
 voidpf_inpcb_unlink_state_key(struct inpcb *);
 voidpf_pktenqueue_delayed(void *);
 int32_t pf_state_expires(const struct pf_state *, 
uint8_t);
+voidpf_state_keys_take(struct pf_state *,
+   struct pf_state_key **);
+voidpf_state_keys_rele(struct pf_state_key **);
 
 #if NPFLOG > 0
 voidpf_log_matches(struct pf_pdesc *, struct pf_rule *,
@@ -772,7 +776,8 @@ 

Re: pfsync panic after pf_purge backout

2022-11-28 Thread Alexandr Nedvedicky
Hello,


> 
> 
> Hi,
> 
> here's panic with WITNESS and this diff on tech@
> https://www.mail-archive.com/tech@openbsd.org/msg72582.html
> 
> I will stop now because I'm not sure what I'm doing and which diffs I'm
> testing...
> 
> 
> r620-1# uvm_fault(0x8248ea28, 0x17, 0, 2) -> e
> kernel: page fault trap, code=0
> Stopped at  pfsync_q_del+0x96:  movq%rdx,0x8(%rax)
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *300703  35643  0 0x14000  0x2001K systq
>  237790  10061  0 0x14000 0x42000  softclock
> pfsync_q_del(fd8323dc3900) at pfsync_q_del+0x96
> pfsync_delete_state(fd8323dc3900) at pfsync_delete_state+0x118
> pf_remove_state(fd8323dc3900) at pf_remove_state+0x14e
> pf_purge_expired_states(c3501) at pf_purge_expired_states+0x1b3
> pf_purge(823ae080) at pf_purge+0x28
> taskq_thread(822cbe30) at taskq_thread+0x11a
> end trace frame: 0x0, count: 9
> https://www.openbsd.org/ddb.html describes the minimum info required in
> bug reports.  Insufficient info makes it difficult to find and fix bugs.
> ddb{1}>
> 
> 

diff below should avoid panic above (and similar panics in pfsync_q_del().
It also prints some 'error' system message buffer (a.k.a. dmesg)

We panic because we attempt to remove state from psync queue which is
already empty. the pfsync_q_del() must be acting on some stale information
kept in `st` argument (state).

the information we print into dmesg buffer should help us understand
what's going on. At the moment I can not explain how does it come
there is a state which claims its presence on state queue, while the
queue in question is empty.

I'd like to ask you to give a try diff below and repeat your test.
Let it run for some time and collect 'dmesg' output for me after usual
uptime-to-panic elapses during a test run.

thanks a lot
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index f69790ee98d..d4be84a1f57 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -2254,6 +2254,74 @@ pfsync_q_ins(struct pf_state *st, int q)
} while (0);
 }
 
+const char *
+pfsync_qname(int q)
+{
+   switch (q) {
+   case PFSYNC_S_IACK:
+   return ("PFSYNC_S_IACK");
+   
+   case PFSYNC_S_UPD_C:
+   return ("PFSYNC_S_UPD_C");
+
+   case PFSYNC_S_DEL:
+   return ("PFSYNC_S_DEL");
+
+   case PFSYNC_S_INS:
+   return ("PFSYNC_S_INS");
+
+   case PFSYNC_S_UPD:
+   return ("PFSYNC_S_UPD");
+
+   case PFSYNC_S_COUNT:
+   return ("PFSYNC_S_COUNT");
+
+   case PFSYNC_S_DEFER:
+   return ("PFSYNC_S_DEFER");
+
+   case PFSYNC_S_NONE:
+   return ("PFSYNC_S_NONE");
+
+   default:
+   return ("???");
+   }
+}
+
+const char *
+pfsync_timeout_name(unsigned int timeout)
+{
+   const char *timeout_name[] = {
+   "PFTM_TCP_FIRST_PACKET",
+   "PFTM_TCP_OPENING",
+   "PFTM_TCP_ESTABLISHED",
+   "PFTM_TCP_CLOSING",
+   "PFTM_TCP_FIN_WAIT",
+   "PFTM_TCP_CLOSED",
+   "PFTM_UDP_FIRST_PACKET",
+   "PFTM_UDP_SINGLE",
+   "PFTM_UDP_MULTIPLE",
+   "PFTM_ICMP_FIRST_PACKET",
+   "PFTM_ICMP_ERROR_REPLY",
+   "PFTM_OTHER_FIRST_PACKET",
+   "PFTM_OTHER_SINGLE",
+   "PFTM_OTHER_MULTIPLE",
+   "PFTM_FRAG",
+   "PFTM_INTERVAL",
+   "PFTM_ADAPTIVE_START",
+   "PFTM_ADAPTIVE_END",
+   "PFTM_SRC_NODE",
+   "PFTM_TS_DIFF",
+   "PFTM_MAX",
+   "PFTM_PURGE",
+   "PFTM_UNLINKED"
+   };
+
+   if (timeout > PFTM_UNLINKED)
+   return ("???");
+   else
+   return (timeout_name[timeout]);
+}
+
 void
 pfsync_q_del(struct pf_state *st)
 {
@@ -2273,6 +2341,19 @@ pfsync_q_del(struct pf_state *st)
mtx_leave(>sc_st_mtx);
return;
}
+
+   if (TAILQ_EMPTY(>sc_qs[q])) {
+   mtx_leave(>sc_st_mtx);
+   DPFPRINTF(LOG_ERR,
+   "%s: stale queue (%s) in state (id %llu)"
+   "nosync: %s unlinked: %s timeout: %s", __func__,
+   pfsync_qname(q), st->id,
+   (st->state_flags & PFSTATE_NOSYNC) ? "yes" : "no",
+   (st->timeout == PFTM_UNLINKED) ? "yes" : "no",
+   pfsync_timeout_name(st->timeout));
+   return;
+   }
+
atomic_sub_long(>sc_len, pfsync_qs[q].len);
TAILQ_REMOVE(>sc_qs[q], st, sync_list);
if (TAILQ_EMPTY(>sc_qs[q]))



Re: pfsync panic after pf_purge backout

2022-11-26 Thread Alexandr Nedvedicky
Hello,

On Sat, Nov 26, 2022 at 08:33:28PM +0100, Hrvoje Popovski wrote:

> I just need to say that with all pf, pfsync and with pf_purge diffs
> after hackaton + this diff on tech@
> https://www.mail-archive.com/tech@openbsd.org/msg72582.html
> my production firewall seems stable and it wasn't without that diff

this diff still waits for OK. it makes pfsync to use
state mutex to safely dereference keys.

> 
> I'm not sure if we have same diffs but even Josmar Pierri on bugs@
> https://www.mail-archive.com/bugs@openbsd.org/msg18994.html
> who had panics quite regularly with that diff on tech@ seems to have
> stable firewall now.
> 
> 
> 
> r620-1# uvm_fault(0x82374288, 0x17, 0, 2) -> e
> kernel: page fault trap, code=0
> Stopped at  pfsync_q_del+0x96:  movq%rdx,0x8(%rax)
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *192892  19920  0 0x14000  0x2005K softnet
> pfsync_q_del(fd82e8a4ce20) at pfsync_q_del+0x96
> pf_remove_state(fd82e8a4ce20) at pf_remove_state+0x14b
> pfsync_in_del_c(fd8006d843b8,c,79,0) at pfsync_in_del_c+0x6f
> pfsync_input(800022d60ad8,800022d60ae4,f0,2) at pfsync_input+0x33c
> ip_deliver(800022d60ad8,800022d60ae4,f0,2) at ip_deliver+0x113
> ipintr() at ipintr+0x69
> if_netisr(0) at if_netisr+0xea
> taskq_thread(8003) at taskq_thread+0x100
> end trace frame: 0x0, count: 7
> https://www.openbsd.org/ddb.html describes the minimum info required in
> bug reports.  Insufficient info makes it difficult to find and fix bugs.
> ddb{5}>
> 

those panics are causing me headaches. this got most-likely uncovered
by diff which adds a mutex. The mutex makes pfsync stable enough
so you can trigger unknown bugs.

thanks and
regards
sashan



Re: more consistently use "st" as the name for struct pf_state variables

2023-01-05 Thread Alexandr Nedvedicky
On Thu, Jan 05, 2023 at 08:32:44PM +1000, David Gwynne wrote:
> 
> 
> > On 5 Jan 2023, at 18:56, Alexandr Nedvedicky  wrote:
> > 
> > Hello,
> > 
> > On Wed, Jan 04, 2023 at 09:36:38PM +1000, David Gwynne wrote:
> >> and "stp" for pf_state ** variables.
> >> 
> >I agree with established naming conventions.
> > 
> >I'm also fine with keeping some exceptions such as `a` and `b`
> >in pf_state_compare_id(), local variables `tail`, `head`
> >in pf_states_{clr, get}() and pf_purge_expired_states().
> >I'm also fine with leaving static variable `cur` unchanged.
> > 
> > is there any reason we still keep `pf_state **sm` argument
> > in pf_test_rule()? the same in pf_create_state(). Is it intended?
> 
> there were a bunch of other arguments that ended with m. happy to change it
> to **stp though. we can always do another sweep for other types.
> 

I would change those occurrences to stp. my point is that we don't
expect pf_test_rule() to return matching state.

with those tweaks diff is OK

thanks and
regards
sashan



Re: unlock pf_purge

2022-11-06 Thread Alexandr Nedvedicky
Hello,


> 
> claudio had some questions about numbers used by the diff. some of the
> maths here is set up so that every pf state purge run is guaranteed to
> do a MINUMUM amount of work. by default pf is configured with a purge
> interfval of 10 seconds, bu the pf state purge processing runs every
> second and tries to process 1/interval (ie 1/10th) of the states. if we
> have 4 states, then 4/10 is 0, so no states would be purged. by having a
> minimum of 64 states processed every second, we guarantee this small
> number of states gets processed.
> 
> i would like to commit this diff now (at the start of h2k22) so i can
> keep an eye on it. ive been running it for a while without issues.
> 
> the obvious next step will be to remove the NET_LOCKs.

I think this can go in as-is. I'm fine with change.
found just small nit, but it's more a matter of personal state.
leaving it up to David to go for it or not. Details are below.

OK sashan

> 
> Index: pf.c
> ===
> RCS file: /cvs/src/sys/net/pf.c,v
> retrieving revision 1.1141
> diff -u -p -r1.1141 pf.c
> --- pf.c  10 Oct 2022 16:43:12 -  1.1141
> +++ pf.c  6 Nov 2022 13:08:15 -


> @@ -1559,28 +1619,38 @@ pf_purge_expired_states(u_int32_t maxche
>  
>   now = getuptime();
>  
> - do {
> + for (scanned = 0; scanned < limit; scanned++) {
>   uint8_t stimeout = cur->timeout;
> + unsigned int limited = 0;
>  
>   if ((stimeout == PFTM_UNLINKED) ||
>   (pf_state_expires(cur, stimeout) <= now)) {
>   st = pf_state_ref(cur);
>   SLIST_INSERT_HEAD(, st, gc_list);
> +
> + if (++collected >= collect)
> + limited = 1;
>   }

I think we can get away without introducing a 'limited' local
variable.

>  
>   /* don't iterate past the end of our view of the list */
>   if (cur == tail) {
> + scanned = limit;
>   cur = NULL;
>   break;
>   }
>  
>   cur = TAILQ_NEXT(cur, entry_list);
> - } while (maxcheck--);
> +
> + /* don't spend too much time here. */
> + if (ISSET(READ_ONCE(curcpu()->ci_schedstate.spc_schedflags),
> +  SPCF_SHOULDYIELD) || limited)
> + break;
> + }
>  

and of course the condition above needs to be changed to:
 SPCF_SHOULDYIELD) || (collected >= collect))

As I've said it's just nit.



Re: make /dev/pf a clonable device

2022-11-06 Thread Alexandr Nedvedicky
On Sun, Nov 06, 2022 at 10:42:49PM +1000, David Gwynne wrote:
> this is a small chunk to help sashan@ out with some of the pf ioctl work
> he is doing.
> 
> he is looking at allocating config over multiple ioctls, and would like
> to be able to throw it away in situations like if the userland program
> creating the state goes away. with the current vnode and device special
> semantics, only the last close will call pfclose, which is a nice place
> to do cleanup. if a long running process has /dev/pf open, then he'll
> never be able to clean up.
> 
> cloning also turns the dev_t into a nice identifier to use to
> associate these allocations with, which makes the cleanup more robust.
> using something like the pid or curproc allows for userland to confuse
> pf too easily.
> 
> ok?

yes, please.

OK sashan



Re: MAKEDEV: there's only one pf(4) device

2022-11-06 Thread Alexandr Nedvedicky
On Sun, Nov 06, 2022 at 01:06:02PM +, Klemens Nanni wrote:
> Just like bpf(4)... except bpf actually also creates bpf0 still.
> 
>   $ ls -1 /dev/{b,}pf*
>   /dev/bpf
>   /dev/bpf0
>   /dev/pf
> 
> OK?

looks ok to me, I think no on one expects to create /dev/pf0, /dev/pf1 ...
Would it also make sense to update MAKEDEV(8) too. It currently
reads as:

 Special purpose devices
 apm Power Management Interface, see apm(4).
 audio*  Audio devices, see audio(4).
 bio ioctl tunnel pseudo-device, see bio(4).
 ...
 pctr*   PC Performance Tuning Register access device, see pctr(4).
 pf* Packet Filter, see pf(4).
 pppx*   PPP Multiplexer, see pppx(4).


thanks and
regards
sashan



Re: MAKEDEV: there's only one pf(4) device

2022-11-06 Thread Alexandr Nedvedicky
On Sun, Nov 06, 2022 at 04:48:11PM +, Klemens Nanni wrote:
> On Sun, Nov 06, 2022 at 05:43:15PM +0100, Alexandr Nedvedicky wrote:
> > On Sun, Nov 06, 2022 at 01:06:02PM +, Klemens Nanni wrote:
> > > Just like bpf(4)... except bpf actually also creates bpf0 still.
> > > 
> > >   $ ls -1 /dev/{b,}pf*
> > >   /dev/bpf
> > >   /dev/bpf0
> > >   /dev/pf
> > > 
> > > OK?
> > 
> > looks ok to me, I think no on one expects to create /dev/pf0, /dev/pf1 
> > ...
> > Would it also make sense to update MAKEDEV(8) too. It currently
> > reads as:
> 
> Scripts and manuals are auto-generated for all architectures, so once
> the MAKEDEV.common template is committed, one has to regenerate and
> commit etc/ and usr/share/man/man8/.

I did not know that. Then it should be committed.

OK sashan



interface hooks to pf(4) should be using PF_LOCK()/PF_UNLOCK()

2022-11-06 Thread Alexandr Nedvedicky
Hello,

diff below is the first step to make pfioctl() _without_ NET_LOCK().
Currently pf_if.c seems to be the only blocker which prevents us
from removing all NET_LOCK()/NET_UNLOCK() calls we have in pf(4).

diff below passed very basic smoke test. OK to commit?


thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_if.c b/sys/net/pf_if.c
index e23c14e6769..e86d85aa2c4 100644
--- a/sys/net/pf_if.c
+++ b/sys/net/pf_if.c
@@ -53,10 +53,17 @@
 
 #include 
 
+#include 
+#include 
+#include 
+
 #ifdef INET6
 #include 
+#include 
 #endif /* INET6 */
 
+#include 
+
 #define isupper(c) ((c) >= 'A' && (c) <= 'Z')
 #define islower(c) ((c) >= 'a' && (c) <= 'z')
 #define isalpha(c) (isupper(c)||islower(c))
@@ -296,6 +303,7 @@ pfi_attach_ifnet(struct ifnet *ifp)
struct pfi_kif  *kif;
struct task *t;
 
+   PF_LOCK();
pfi_initialize();
pfi_update++;
if ((kif = pfi_kif_get(ifp->if_xname, NULL)) == NULL)
@@ -310,6 +318,7 @@ pfi_attach_ifnet(struct ifnet *ifp)
kif->pfik_ah_cookie = t;
 
pfi_kif_update(kif);
+   PF_UNLOCK();
 }
 
 void
@@ -321,6 +330,7 @@ pfi_detach_ifnet(struct ifnet *ifp)
if ((kif = (struct pfi_kif *)ifp->if_pf_kif) == NULL)
return;
 
+   PF_LOCK();
pfi_update++;
t = kif->pfik_ah_cookie;
kif->pfik_ah_cookie = NULL;
@@ -332,6 +342,7 @@ pfi_detach_ifnet(struct ifnet *ifp)
kif->pfik_ifp = NULL;
ifp->if_pf_kif = NULL;
pfi_kif_unref(kif, PFI_KIF_REF_NONE);
+   PF_UNLOCK();
 }
 
 void
@@ -339,6 +350,7 @@ pfi_attach_ifgroup(struct ifg_group *ifg)
 {
struct pfi_kif  *kif;
 
+   PF_LOCK();
pfi_initialize();
pfi_update++;
if ((kif = pfi_kif_get(ifg->ifg_group, NULL)) == NULL)
@@ -346,6 +358,7 @@ pfi_attach_ifgroup(struct ifg_group *ifg)
 
kif->pfik_group = ifg;
ifg->ifg_pf_kif = (caddr_t)kif;
+   PF_UNLOCK();
 }
 
 void
@@ -356,11 +369,13 @@ pfi_detach_ifgroup(struct ifg_group *ifg)
if ((kif = (struct pfi_kif *)ifg->ifg_pf_kif) == NULL)
return;
 
+   PF_LOCK();
pfi_update++;
 
kif->pfik_group = NULL;
ifg->ifg_pf_kif = NULL;
pfi_kif_unref(kif, PFI_KIF_REF_NONE);
+   PF_UNLOCK();
 }
 
 void
@@ -378,15 +393,19 @@ pfi_group_change(const char *group)
 void
 pfi_group_delmember(const char *group)
 {
+   PF_LOCK();
pfi_group_change(group);
pfi_xcommit();
+   PF_UNLOCK();
 }
 
 void
 pfi_group_addmember(const char *group)
 {
+   PF_LOCK();
pfi_group_change(group);
pfi_xcommit();
+   PF_UNLOCK();
 }
 
 int
@@ -686,8 +705,10 @@ pfi_kifaddr_update(void *v)
 
NET_ASSERT_LOCKED();
 
+   PF_LOCK();
pfi_update++;
pfi_kif_update(kif);
+   PF_UNLOCK();
 }
 
 int



Re: tweak pfsync status output in ifconfig

2022-11-08 Thread Alexandr Nedvedicky
On Wed, Nov 09, 2022 at 12:09:50AM +1000, David Gwynne wrote:
> i'm hacking on pfsync(4) at the moment, and i wasted way too much time
> wondering how i broke the pfsync ioctls after i didn't the pfsync_status
> output. turns out if you don't have a sync interface set, it skips
> output.
> 
> i think it's useful to show that the sync interface is not set, so i
> came up with this.
> 
> an unconfigured pfsync interface looks like this with my diff:
> 
> pfsync0: flags=0<> mtu 1500
>   index 5 priority 0 llprio 3
>   pfsync: syncdev none maxupd 128 defer off
>   groups: carp pfsync

looks useful to me. I certainly don't object.

OK sashan



Re: adding a mutex to pf_state

2022-11-10 Thread Alexandr Nedvedicky
Hello,

David (dlg@) asked me to shrink the scope of the change.  The new diff
introduces a mutex to pf_state and adjust pf_detach_state() to utilize the
newly introduced mutex.  I've also added annotation to pf_state members. The
members with '?' needs our attention. We need to verify if they are either
protected by global state lock, or if we cover them by newly introduced
mutex.

new diff also adds mtx_init() to pf_state_import(). This was missing in
my earlier diff. It was found and reported by Hrvoje@.

thanks and
regards
sashan


8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 2c13c99baac..c6022867dd4 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -185,7 +185,8 @@ int  pf_translate_icmp_af(struct pf_pdesc*, 
int, void *);
 voidpf_send_icmp(struct mbuf *, u_int8_t, u_int8_t, int,
sa_family_t, struct pf_rule *, u_int);
 voidpf_detach_state(struct pf_state *);
-voidpf_state_key_detach(struct pf_state *, int);
+voidpf_state_key_detach(struct pf_state *,
+   struct pf_state_key *);
 u_int32_t   pf_tcp_iss(struct pf_pdesc *);
 voidpf_rule_to_actions(struct pf_rule *,
struct pf_rule_actions *);
@@ -779,7 +780,8 @@ pf_state_key_attach(struct pf_state_key *sk, struct 
pf_state *s, int idx)
s->key[idx] = sk;
 
if ((si = pool_get(_state_item_pl, PR_NOWAIT)) == NULL) {
-   pf_state_key_detach(s, idx);
+   pf_state_key_detach(s, s->key[idx]);
+   s->key[idx] = NULL;
return (-1);
}
si->s = s;
@@ -799,42 +801,50 @@ pf_state_key_attach(struct pf_state_key *sk, struct 
pf_state *s, int idx)
 void
 pf_detach_state(struct pf_state *s)
 {
-   if (s->key[PF_SK_WIRE] == s->key[PF_SK_STACK])
-   s->key[PF_SK_WIRE] = NULL;
+   struct pf_state_key *key[2];
+
+   mtx_enter(>mtx);
+   key[PF_SK_WIRE] = s->key[PF_SK_WIRE];
+   key[PF_SK_STACK] = s->key[PF_SK_STACK];
+   s->key[PF_SK_WIRE] = NULL;
+   s->key[PF_SK_STACK] = NULL;
+   mtx_leave(>mtx);
+
+   if (key[PF_SK_WIRE] == key[PF_SK_STACK])
+   key[PF_SK_WIRE] = NULL;
 
-   if (s->key[PF_SK_STACK] != NULL)
-   pf_state_key_detach(s, PF_SK_STACK);
+   if (key[PF_SK_STACK] != NULL)
+   pf_state_key_detach(s, key[PF_SK_STACK]);
 
-   if (s->key[PF_SK_WIRE] != NULL)
-   pf_state_key_detach(s, PF_SK_WIRE);
+   if (key[PF_SK_WIRE] != NULL)
+   pf_state_key_detach(s, key[PF_SK_WIRE]);
 }
 
 void
-pf_state_key_detach(struct pf_state *s, int idx)
+pf_state_key_detach(struct pf_state *s, struct pf_state_key *key)
 {
struct pf_state_item*si;
-   struct pf_state_key *sk;
 
-   if (s->key[idx] == NULL)
+   PF_STATE_ASSERT_LOCKED();
+
+   if (key == NULL)
return;
 
-   si = TAILQ_FIRST(>key[idx]->states);
+   si = TAILQ_FIRST(>states);
while (si && si->s != s)
si = TAILQ_NEXT(si, entry);
 
if (si) {
-   TAILQ_REMOVE(>key[idx]->states, si, entry);
+   TAILQ_REMOVE(>states, si, entry);
pool_put(_state_item_pl, si);
}
 
-   sk = s->key[idx];
-   s->key[idx] = NULL;
-   if (TAILQ_EMPTY(>states)) {
-   RB_REMOVE(pf_state_tree, _statetbl, sk);
-   sk->removed = 1;
-   pf_state_key_unlink_reverse(sk);
-   pf_state_key_unlink_inpcb(sk);
-   pf_state_key_unref(sk);
+   if (TAILQ_EMPTY(>states)) {
+   RB_REMOVE(pf_state_tree, _statetbl, key);
+   key->removed = 1;
+   pf_state_key_unlink_reverse(key);
+   pf_state_key_unlink_inpcb(key);
+   pf_state_key_unref(key);
}
 }
 
@@ -997,7 +1007,9 @@ pf_state_insert(struct pfi_kif *kif, struct pf_state_key 
**skw,
}
*skw = s->key[PF_SK_WIRE];
if (pf_state_key_attach(*sks, s, PF_SK_STACK)) {
-   pf_state_key_detach(s, PF_SK_WIRE);
+   pf_state_key_detach(s, s->key[PF_SK_WIRE]);
+   s->key[PF_SK_WIRE] = NULL;
+   *skw = NULL;
PF_STATE_EXIT_WRITE();
return (-1);
}
@@ -1429,6 +1441,7 @@ pf_state_import(const struct pfsync_state *sp, int flags)
st->sync_state = PFSYNC_S_NONE;
 
refcnt_init(>refcnt);
+   mtx_init(>mtx, IPL_NET);
 
/* XXX when we have anchors, use STATE_INC_COUNTERS */
r->states_cur++;
@@ -4348,6 +4361,7 @@ pf_create_state(struct pf_pdesc *pd, struct pf_rule *r, 
struct pf_rule *a,
 * pf_state_inserts() grabs reference for pfsync!
 */

relax KASSET() to if () in pfsync_grab_snapshot()

2022-11-11 Thread Alexandr Nedvedicky
Hello,

Diff below changes KASSERT() to if (). We have to prevent
packets to insert state to snapshot queue multiple times.
Hrvoje@ can trigger situation where state updates to pfsync
peer are more frequent than we are able to send out.

OK to go for this simple fix/workaround ?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index d279ede9cd6..fdff3d7a509 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -1362,10 +1362,17 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, struct 
pfsync_softc *sc)
TAILQ_INIT(>sn_qs[q]);
 
while ((st = TAILQ_FIRST(>sc_qs[q])) != NULL) {
-   KASSERT(st->snapped == 0);
TAILQ_REMOVE(>sc_qs[q], st, sync_list);
-   TAILQ_INSERT_TAIL(>sn_qs[q], st, sync_snap);
-   st->snapped = 1;
+   if (st->snapped == 0) {
+   TAILQ_INSERT_TAIL(>sn_qs[q], st, sync_snap);
+   st->snapped = 1;
+   } else {
+   /*
+* item is on snapshot list already, just give
+* up this attempt to update it.
+*/
+   pf_state_unref(st);
+   }
}
}
 



simplify handling of 'once' rules in pf(4)

2022-11-07 Thread Alexandr Nedvedicky
Hello,

diff below simplifies handling of 'once' rules in pf(4).
Currently matching packet marks 'once' rule as expired
and puts it to garbage collection list, where the rule
waits to be removed from its ruleset by timer.

diff below simplifies that. matching packet marks
once rule as expired and leaves that rule in ruleset.
The 'expired' flag prevents other packets to match
such rule. The expired rule will be kept in ruleset until
the  ruleset itself is either destroyed or updated by DIOCXCOMMIT
operation.

OK ?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 4c70b08571e..1c06e2f6eeb 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -317,9 +317,6 @@ RB_GENERATE(pf_state_tree, pf_state_key, entry, 
pf_state_compare_key);
 RB_GENERATE(pf_state_tree_id, pf_state,
 entry_id, pf_state_compare_id);
 
-SLIST_HEAD(pf_rule_gcl, pf_rule)   pf_rule_gcl =
-   SLIST_HEAD_INITIALIZER(pf_rule_gcl);
-
 __inline int
 pf_addr_compare(struct pf_addr *a, struct pf_addr *b, sa_family_t af)
 {
@@ -1263,23 +1260,6 @@ pf_state_export(struct pfsync_state *sp, struct pf_state 
*st)
 
 /* END state table stuff */
 
-void
-pf_purge_expired_rules(void)
-{
-   struct pf_rule  *r;
-
-   PF_ASSERT_LOCKED();
-
-   if (SLIST_EMPTY(_rule_gcl))
-   return;
-
-   while ((r = SLIST_FIRST(_rule_gcl)) != NULL) {
-   SLIST_REMOVE(_rule_gcl, r, pf_rule, gcle);
-   KASSERT(r->rule_flag & PFRULE_EXPIRED);
-   pf_purge_rule(r);
-   }
-}
-
 void
 pf_purge_timeout(void *unused)
 {
@@ -1307,10 +1287,8 @@ pf_purge(void *xnloops)
 
PF_LOCK();
/* purge other expired types every PFTM_INTERVAL seconds */
-   if (++(*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL]) {
+   if (++(*nloops) >= pf_default_rule.timeout[PFTM_INTERVAL])
pf_purge_expired_src_nodes();
-   pf_purge_expired_rules();
-   }
PF_UNLOCK();
 
/*
@@ -3625,8 +3603,11 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
struct pf_rule  *save_a;
struct pf_ruleset   *save_aruleset;
 
+retry:
r = TAILQ_FIRST(ruleset->rules.active.ptr);
while (r != NULL) {
+   PF_TEST_ATTRIB(r->rule_flag & PFRULE_EXPIRED,
+   TAILQ_NEXT(r, entries));
r->evaluations++;
PF_TEST_ATTRIB(
(pfi_kif_match(r->kif, ctx->pd->kif) == r->ifnot),
@@ -3751,6 +3732,19 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
if (r->tag)
ctx->tag = r->tag;
if (r->anchor == NULL) {
+
+   if (r->rule_flag & PFRULE_ONCE) {
+   u_int32_t   rule_flag;
+
+   rule_flag = r->rule_flag;
+   if (((rule_flag & PFRULE_EXPIRED) == 0) &&
+   atomic_cas_uint(>rule_flag, rule_flag,
+   rule_flag | PFRULE_EXPIRED) == rule_flag)
+   r->exptime = gettime();
+   else
+   goto retry;
+   }
+
if (r->action == PF_MATCH) {
if ((ctx->ri = pool_get(_rule_item_pl,
PR_NOWAIT)) == NULL) {
@@ -3962,13 +3956,6 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
if (r->action == PF_DROP)
goto cleanup;
 
-   /*
-* If an expired "once" rule has not been purged, drop any new matching
-* packets.
-*/
-   if (r->rule_flag & PFRULE_EXPIRED)
-   goto cleanup;
-
pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
if (ctx.act.rtableid >= 0 &&
rtable_l2(ctx.act.rtableid) != pd->rdomain)
@@ -4039,22 +4026,6 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
m_copyback(pd->m, pd->off, pd->hdrlen, >hdr, M_NOWAIT);
}
 
-   if (r->rule_flag & PFRULE_ONCE) {
-   u_int32_t   rule_flag;
-
-   /*
-* Use atomic_cas() to determine a clear winner, which will
-* insert an expired rule to gcl.
-*/
-   rule_flag = r->rule_flag;
-   if (((rule_flag & PFRULE_EXPIRED) == 0) &&
-   atomic_cas_uint(>rule_flag, rule_flag,
-   rule_flag | PFRULE_EXPIRED) == rule_flag) {
-   r->exptime = gettime();
-   SLIST_INSERT_HEAD(_rule_gcl, r, gcle);
-   }
-   }
-
 #if NPFSYNC > 0
if (*sm != NULL && !ISSET((*sm)->state_flags, PFSTATE_NOSYNC) &&
pd->dir == PF_OUT && pfsync_up()) {
diff --git 

Re: simplify handling of 'once' rules in pf(4)

2022-11-07 Thread Alexandr Nedvedicky
Hello,

resending the same diff, just updated to current.
(pointed out by dlg@)

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 2c6124e74f2..6295b4eb9d7 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -313,9 +313,6 @@ RB_GENERATE(pf_state_tree, pf_state_key, entry, 
pf_state_compare_key);
 RB_GENERATE(pf_state_tree_id, pf_state,
 entry_id, pf_state_compare_id);
 
-SLIST_HEAD(pf_rule_gcl, pf_rule)   pf_rule_gcl =
-   SLIST_HEAD_INITIALIZER(pf_rule_gcl);
-
 __inline int
 pf_addr_compare(struct pf_addr *a, struct pf_addr *b, sa_family_t af)
 {
@@ -1478,23 +1475,6 @@ pf_state_import(const struct pfsync_state *sp, int flags)
 
 /* END state table stuff */
 
-void
-pf_purge_expired_rules(void)
-{
-   struct pf_rule  *r;
-
-   PF_ASSERT_LOCKED();
-
-   if (SLIST_EMPTY(_rule_gcl))
-   return;
-
-   while ((r = SLIST_FIRST(_rule_gcl)) != NULL) {
-   SLIST_REMOVE(_rule_gcl, r, pf_rule, gcle);
-   KASSERT(r->rule_flag & PFRULE_EXPIRED);
-   pf_purge_rule(r);
-   }
-}
-
 voidpf_purge_states(void *);
 struct task pf_purge_states_task =
 TASK_INITIALIZER(pf_purge_states, NULL);
@@ -1588,7 +1568,6 @@ pf_purge(void *null)
PF_LOCK();
 
pf_purge_expired_src_nodes();
-   pf_purge_expired_rules();
 
PF_UNLOCK();
 
@@ -3916,8 +3895,11 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
struct pf_rule  *save_a;
struct pf_ruleset   *save_aruleset;
 
+retry:
r = TAILQ_FIRST(ruleset->rules.active.ptr);
while (r != NULL) {
+   PF_TEST_ATTRIB(r->rule_flag & PFRULE_EXPIRED,
+   TAILQ_NEXT(r, entries));
r->evaluations++;
PF_TEST_ATTRIB(
(pfi_kif_match(r->kif, ctx->pd->kif) == r->ifnot),
@@ -4042,6 +4024,19 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
if (r->tag)
ctx->tag = r->tag;
if (r->anchor == NULL) {
+
+   if (r->rule_flag & PFRULE_ONCE) {
+   u_int32_t   rule_flag;
+
+   rule_flag = r->rule_flag;
+   if (((rule_flag & PFRULE_EXPIRED) == 0) &&
+   atomic_cas_uint(>rule_flag, rule_flag,
+   rule_flag | PFRULE_EXPIRED) == rule_flag)
+   r->exptime = gettime();
+   else
+   goto retry;
+   }
+
if (r->action == PF_MATCH) {
if ((ctx->ri = pool_get(_rule_item_pl,
PR_NOWAIT)) == NULL) {
@@ -4253,13 +4248,6 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
if (r->action == PF_DROP)
goto cleanup;
 
-   /*
-* If an expired "once" rule has not been purged, drop any new matching
-* packets.
-*/
-   if (r->rule_flag & PFRULE_EXPIRED)
-   goto cleanup;
-
pf_tag_packet(pd->m, ctx.tag, ctx.act.rtableid);
if (ctx.act.rtableid >= 0 &&
rtable_l2(ctx.act.rtableid) != pd->rdomain)
@@ -4330,22 +4318,6 @@ pf_test_rule(struct pf_pdesc *pd, struct pf_rule **rm, 
struct pf_state **sm,
m_copyback(pd->m, pd->off, pd->hdrlen, >hdr, M_NOWAIT);
}
 
-   if (r->rule_flag & PFRULE_ONCE) {
-   u_int32_t   rule_flag;
-
-   /*
-* Use atomic_cas() to determine a clear winner, which will
-* insert an expired rule to gcl.
-*/
-   rule_flag = r->rule_flag;
-   if (((rule_flag & PFRULE_EXPIRED) == 0) &&
-   atomic_cas_uint(>rule_flag, rule_flag,
-   rule_flag | PFRULE_EXPIRED) == rule_flag) {
-   r->exptime = gettime();
-   SLIST_INSERT_HEAD(_rule_gcl, r, gcle);
-   }
-   }
-
 #if NPFSYNC > 0
if (*sm != NULL && !ISSET((*sm)->state_flags, PFSTATE_NOSYNC) &&
pd->dir == PF_OUT && pfsync_up()) {
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index b5d3261f8d1..6744fc022fb 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -348,27 +348,6 @@ pf_rm_rule(struct pf_rulequeue *rulequeue, struct pf_rule 
*rule)
pool_put(_rule_pl, rule);
 }
 
-void
-pf_purge_rule(struct pf_rule *rule)
-{
-   u_int32_tnr = 0;
-   struct pf_ruleset   *ruleset;
-
-   KASSERT((rule != NULL) && (rule->ruleset != NULL));
-   ruleset = rule->ruleset;
-
-   pf_rm_rule(ruleset->rules.active.ptr, rule);
-   ruleset->rules.active.rcount--;
-   

Re: simplify handling of 'once' rules in pf(4)

2022-11-07 Thread Alexandr Nedvedicky
Hello,

updated diff. It buys suggestions from Klemens:

pf.conf(5) section which covers once rules reads as follows:

 onceCreates a one shot rule. The first matching packet marks rule as
 expired.  The expired rule is never evaluated then.  pfctl(8)
 does not report expired rules unless run in verbose mode ('-vv').
 In verbose mode pfctl(8) appends  '# expired' to note the once
 rule which got hit by packet other already.

this how pfctl reports expired once rules now:

pf# pfctl -a test/foo/bar -vv -sr 
@0 pass out proto tcp from any to any port = 80 flags S/SA once # 
expired
  [ Evaluations: 25Packets: 5489  Bytes: 4936817 
States: 0 ]
  [ Inserted: uid 0 pid 2768 State Creations: 1 ]

updated diff is below.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sbin/pfctl/pfctl_parser.c b/sbin/pfctl/pfctl_parser.c
index 6f39ad72384..fbe19747fc6 100644
--- a/sbin/pfctl/pfctl_parser.c
+++ b/sbin/pfctl/pfctl_parser.c
@@ -1148,6 +1148,9 @@ print_rule(struct pf_rule *r, const char *anchor_call, 
int opts)
printf(" ");
print_pool(>route, 0, 0, r->af, PF_POOL_ROUTE, verbose);
}
+
+   if (r->rule_flag & PFRULE_EXPIRED)
+   printf(" # expired");
 }
 
 void
diff --git a/share/man/man5/pf.conf.5 b/share/man/man5/pf.conf.5
index 3e5a17acb95..b98b9f2fd9c 100644
--- a/share/man/man5/pf.conf.5
+++ b/share/man/man5/pf.conf.5
@@ -661,10 +661,14 @@ When the rate is exceeded, all ICMP is blocked until the 
rate falls below
 100 per 10 seconds again.
 .Pp
 .It Cm once
-Creates a one shot rule that will remove itself from an active ruleset after
-the first match.
-In case this is the only rule in the anchor, the anchor will be destroyed
-automatically after the rule is matched.
+Creates a one shot rule. The first matching packet marks rule as expired.
+The expired rule is never evaluated then.
+.Xr pfctl 8
+does not report expired rules unless run in verbose mode ('-vv'). In verbose
+mode
+.Xr pfctl 8
+appends  '# expired' to note the once rule which got hit by packet other
+already.
 .Pp
 .It Cm probability Ar number Ns %
 A probability attribute can be attached to a rule,
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 2c6124e74f2..6295b4eb9d7 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -313,9 +313,6 @@ RB_GENERATE(pf_state_tree, pf_state_key, entry, 
pf_state_compare_key);
 RB_GENERATE(pf_state_tree_id, pf_state,
 entry_id, pf_state_compare_id);
 
-SLIST_HEAD(pf_rule_gcl, pf_rule)   pf_rule_gcl =
-   SLIST_HEAD_INITIALIZER(pf_rule_gcl);
-
 __inline int
 pf_addr_compare(struct pf_addr *a, struct pf_addr *b, sa_family_t af)
 {
@@ -1478,23 +1475,6 @@ pf_state_import(const struct pfsync_state *sp, int flags)
 
 /* END state table stuff */
 
-void
-pf_purge_expired_rules(void)
-{
-   struct pf_rule  *r;
-
-   PF_ASSERT_LOCKED();
-
-   if (SLIST_EMPTY(_rule_gcl))
-   return;
-
-   while ((r = SLIST_FIRST(_rule_gcl)) != NULL) {
-   SLIST_REMOVE(_rule_gcl, r, pf_rule, gcle);
-   KASSERT(r->rule_flag & PFRULE_EXPIRED);
-   pf_purge_rule(r);
-   }
-}
-
 voidpf_purge_states(void *);
 struct task pf_purge_states_task =
 TASK_INITIALIZER(pf_purge_states, NULL);
@@ -1588,7 +1568,6 @@ pf_purge(void *null)
PF_LOCK();
 
pf_purge_expired_src_nodes();
-   pf_purge_expired_rules();
 
PF_UNLOCK();
 
@@ -3916,8 +3895,11 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
struct pf_rule  *save_a;
struct pf_ruleset   *save_aruleset;
 
+retry:
r = TAILQ_FIRST(ruleset->rules.active.ptr);
while (r != NULL) {
+   PF_TEST_ATTRIB(r->rule_flag & PFRULE_EXPIRED,
+   TAILQ_NEXT(r, entries));
r->evaluations++;
PF_TEST_ATTRIB(
(pfi_kif_match(r->kif, ctx->pd->kif) == r->ifnot),
@@ -4042,6 +4024,19 @@ pf_match_rule(struct pf_test_ctx *ctx, struct pf_ruleset 
*ruleset)
if (r->tag)
ctx->tag = r->tag;
if (r->anchor == NULL) {
+
+   if (r->rule_flag & PFRULE_ONCE) {
+   u_int32_t   rule_flag;
+
+   rule_flag = r->rule_flag;
+   if (((rule_flag & PFRULE_EXPIRED) == 0) &&
+   atomic_cas_uint(>rule_flag, rule_flag,
+   rule_flag | PFRULE_EXPIRED) == rule_flag)
+   r->exptime = gettime();
+   else
+   goto retry;
+   }
+
if (r->action == PF_MATCH) {
if 

adding a mutex to pf_state

2022-11-09 Thread Alexandr Nedvedicky
hello,

diff below adds a mutex to pf_state. It fixes a NULL pointer dereference panic
reported by Hrvoje sometime ago [1].  Besides adding a mutex to state the diff
addresses a race between pfsync and state purge thread. What happened in this
particular case was that state expired and its state keys got detached while it
was waiting to be processed by pfsync. Once pfsync got to it found state
keys detached and tripped on null pointer dereference. This is the
race change below fixes.

I'm not too much worried about contention on newly introduced mutex.
the thing is it is not a global mutex it is a per state mutex (per object
mutex). I don't expect to see two cpu's will be updating same state
very often.

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-bugs=166006758231954=2

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index d279ede9cd6..5f92ae6ec45 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -157,16 +157,16 @@ const struct {
 };
 
 struct pfsync_q {
-   void(*write)(struct pf_state *, void *);
+   int (*write)(struct pf_state *, void *);
size_t  len;
u_int8_taction;
 };
 
 /* we have one of these for every PFSYNC_S_ */
-void   pfsync_out_state(struct pf_state *, void *);
-void   pfsync_out_iack(struct pf_state *, void *);
-void   pfsync_out_upd_c(struct pf_state *, void *);
-void   pfsync_out_del(struct pf_state *, void *);
+intpfsync_out_state(struct pf_state *, void *);
+intpfsync_out_iack(struct pf_state *, void *);
+intpfsync_out_upd_c(struct pf_state *, void *);
+intpfsync_out_del(struct pf_state *, void *);
 
 struct pfsync_q pfsync_qs[] = {
{ pfsync_out_iack,  sizeof(struct pfsync_ins_ack), PFSYNC_ACT_INS_ACK },
@@ -1301,24 +1301,26 @@ pfsyncioctl(struct ifnet *ifp, u_long cmd, caddr_t data)
return (0);
 }
 
-void
+int
 pfsync_out_state(struct pf_state *st, void *buf)
 {
struct pfsync_state *sp = buf;
 
pf_state_export(sp, st);
+   return (0);
 }
 
-void
+int
 pfsync_out_iack(struct pf_state *st, void *buf)
 {
struct pfsync_ins_ack *iack = buf;
 
iack->id = st->id;
iack->creatorid = st->creatorid;
+   return (0);
 }
 
-void
+int
 pfsync_out_upd_c(struct pf_state *st, void *buf)
 {
struct pfsync_upd_c *up = buf;
@@ -1329,9 +1331,10 @@ pfsync_out_upd_c(struct pf_state *st, void *buf)
pf_state_peer_hton(>dst, >dst);
up->creatorid = st->creatorid;
up->timeout = st->timeout;
+   return (0);
 }
 
-void
+int
 pfsync_out_del(struct pf_state *st, void *buf)
 {
struct pfsync_del_c *dp = buf;
@@ -1340,6 +1343,7 @@ pfsync_out_del(struct pf_state *st, void *buf)
dp->creatorid = st->creatorid;
 
SET(st->state_flags, PFSTATE_NOSYNC);
+   return (0);
 }
 
 void
@@ -1664,8 +1668,8 @@ pfsync_sendout(void)
KASSERT(st->snapped == 1);
st->sync_state = PFSYNC_S_NONE;
st->snapped = 0;
-   pfsync_qs[q].write(st, m->m_data + offset);
-   offset += pfsync_qs[q].len;
+   if (pfsync_qs[q].write(st, m->m_data + offset) == 0)
+   offset += pfsync_qs[q].len;
 
pf_state_unref(st);
count++;
diff --git a/sys/net/if_pfsync.h b/sys/net/if_pfsync.h
index 5447b829d74..30c1df0de9c 100644
--- a/sys/net/if_pfsync.h
+++ b/sys/net/if_pfsync.h
@@ -326,7 +326,7 @@ int pfsync_sysctl(int *, u_int,  void *, 
size_t *,
 #definePFSYNC_SI_CKSUM 0x02
 #definePFSYNC_SI_ACK   0x04
 intpfsync_state_import(struct pfsync_state *, int);
-void   pfsync_state_export(struct pfsync_state *,
+intpfsync_state_export(struct pfsync_state *,
struct pf_state *);
 
 void   pfsync_insert_state(struct pf_state *);
diff --git a/sys/net/pf.c b/sys/net/pf.c
index c42f76dbc67..1083ee95b9a 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -185,7 +185,8 @@ int  pf_translate_icmp_af(struct pf_pdesc*, 
int, void *);
 voidpf_send_icmp(struct mbuf *, u_int8_t, u_int8_t, int,
sa_family_t, struct pf_rule *, u_int);
 voidpf_detach_state(struct pf_state *);
-voidpf_state_key_detach(struct pf_state *, int);
+voidpf_state_key_detach(struct pf_state *,
+   struct pf_state_key *);
 u_int32_t   pf_tcp_iss(struct pf_pdesc *);
 voidpf_rule_to_actions(struct pf_rule *,
struct pf_rule_actions *);
@@ -260,6 +261,9 @@ void 
pf_state_key_unlink_inpcb(struct pf_state_key *);
 void

pf(4) drops valid IGMP/MLD messages

2023-02-25 Thread Alexandr Nedvedicky
Hello,

the issue has been discovered and analyzed by Luca di Gregorio
on bugs@ [1]. The thread [1] covers all details. People who
wish to read brief summary can skip to Luca's email [2].

To wrap it up the current handling IGMP/MLD messages in pf(4)
is exact opposite. I failed to translate English words from
RFC standards into correct C code. Patch below are two one-liners
which make multicast for IPv4/IPv6 to work again with pf(4) enabled.

Let me quote Luca's summary for IPv6  here:

If the IP Destination Address is multicast, then the TTL
should be 1.

If the IP Destination Address is not multicast, then
no restrictions on TTL.

and Luca's summary for IPv4:

If the IP Destination Address is 224.0.0.0/4, then the TTL
should be 1.

If the IP Destination Address is not 224.0.0.0/4, then no
restrictions on TTL.

As I've said all details and references to RFCs can be found
in Luca's emails on bugs@ [1].

Diff below to fix IGMP/MLD handling has been essentially proposed
by Luca [3]. OK to commit?

thanks and
regards
sashan

[1] https://marc.info/?t=16771405962=1=2

[2] https://marc.info/?l=openbsd-bugs=167727244015783=2

[3] https://marc.info/?l=openbsd-bugs=167722460220004=2

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 8cb1326a160..c328109026c 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6847,7 +6847,7 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, u_short 
*reason)
/* IGMP packets have router alert options, allow them */
if (pd->proto == IPPROTO_IGMP) {
/* According to RFC 1112 ttl must be set to 1. */
-   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
+   if ((h->ip_ttl != 1) && IN_MULTICAST(h->ip_dst.s_addr)) {
DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
REASON_SET(reason, PFRES_IPOPTIONS);
return (PF_DROP);
@@ -7101,8 +7101,8 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr *h, 
u_short *reason)
 * missing then MLD message is invalid and
 * should be discarded.
 */
-   if ((h->ip6_hlim != 1) ||
-   !IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
+   if ((h->ip6_hlim != 1) &&
+   IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
DPFPRINTF(LOG_NOTICE, "Invalid MLD");
REASON_SET(reason, PFRES_IPOPTIONS);
return (PF_DROP);



Re: pf(4) drops valid IGMP/MLD messages

2023-02-26 Thread Alexandr Nedvedicky
Hello,


> > 8<---8<---8<--8<
> > diff --git a/sys/net/pf.c b/sys/net/pf.c
> > index 8cb1326a160..c328109026c 100644
> > --- a/sys/net/pf.c
> > +++ b/sys/net/pf.c
> > @@ -6847,7 +6847,7 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
> > u_short *reason)
> > /* IGMP packets have router alert options, allow them */
> > if (pd->proto == IPPROTO_IGMP) {
> > /* According to RFC 1112 ttl must be set to 1. */
> > -   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
> > +   if ((h->ip_ttl != 1) && IN_MULTICAST(h->ip_dst.s_addr)) {
> > DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
> > REASON_SET(reason, PFRES_IPOPTIONS);
> > return (PF_DROP);
> > @@ -7101,8 +7101,8 @@ pf_walk_header6(struct pf_pdesc *pd, struct ip6_hdr 
> > *h, u_short *reason)
> >  * missing then MLD message is invalid and
> >  * should be discarded.
> >  */
> > -   if ((h->ip6_hlim != 1) ||
> > -   !IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
> > +   if ((h->ip6_hlim != 1) &&
> > +   IN6_IS_ADDR_LINKLOCAL(>ip6_src)) {
> > DPFPRINTF(LOG_NOTICE, "Invalid MLD");
> > REASON_SET(reason, PFRES_IPOPTIONS);
> > return (PF_DROP);
> > 
> 
> Unless I'm missing more context, this hunk looks wrong to me. Valid
> MLD packets must have a ttl of 1 *and* come from a LL address. The
> initial logic seems ok to me.
> 

yes you are right. Your comment made me to take better look
at RFC 1112 [1]. Section 'Informal Protocol Description'
reads as follows:

   Multicast routers send Host Membership Query messages (hereinafter
   called Queries) to discover which host groups have members on their
   attached local networks.  Queries are addressed to the all-hosts
   group (address 224.0.0.1), and carry an IP time-to-live of 1.

I think I've confused all-hosts group (224.0.0.1) with any multicast
address (any class-D 224.0.0.0). I think the diff below is what we
actually need  to get IPv4 IGMP working again:

[1] https://www.ietf.org/rfc/rfc1112.txt

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 8cb1326a160..c50173186da 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -6846,8 +6846,12 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
u_short *reason)
pd->proto = h->ip_p;
/* IGMP packets have router alert options, allow them */
if (pd->proto == IPPROTO_IGMP) {
-   /* According to RFC 1112 ttl must be set to 1. */
-   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
+   /*
+* According to RFC 1112 ttl must be set to 1 in all IGMP
+* packets sent do 224.0.0.1
+*/
+   if ((h->ip_ttl != 1) &&
+   (h->ip_dst.s_addr == INADDR_ALLHOSTS_GROUP)) {
DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
REASON_SET(reason, PFRES_IPOPTIONS);
return (PF_DROP);



Re: assert fail in pfsync_grab_snapshot()

2023-02-21 Thread Alexandr Nedvedicky
Hello Lyndon,

this assert has been removed in current (revision 1.310). The complete diff
reads as follows:

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index d279ede9cd6..64a2da195ab 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -1,4 +1,4 @@
-/* $OpenBSD: if_pfsync.c,v 1.309 2022/11/06 21:34:01 kn Exp $  */
+/* $OpenBSD: if_pfsync.c,v 1.310 2022/11/11 11:22:48 sashan Exp $  */
 
 /*
  * Copyright (c) 2002 Michael Shalayeff
@@ -1362,10 +1362,17 @@ pfsync_grab_snapshot(struct pfsync_snapshot *sn, struct 
pfsync_softc *sc)
TAILQ_INIT(>sn_qs[q]);
 
while ((st = TAILQ_FIRST(>sc_qs[q])) != NULL) {
-   KASSERT(st->snapped == 0);
TAILQ_REMOVE(>sc_qs[q], st, sync_list);
-   TAILQ_INSERT_TAIL(>sn_qs[q], st, sync_snap);
-   st->snapped = 1;
+   if (st->snapped == 0) {
+   TAILQ_INSERT_TAIL(>sn_qs[q], st, sync_snap);
+   st->snapped = 1;
+   } else {
+   /*
+* item is on snapshot list already, so we can
+* skip it now.
+*/
+   pf_state_unref(st);
+   }
}
}
 
8<---8<---8<--8<

commit changes the assert to regular condition. Not sure if diff applies
cleanly to 7.2

regards
sashan

On Tue, Feb 21, 2023 at 11:58:51AM -0800, Lyndon Nerenberg (VE7TFX/VE6BBM) 
wrote:
> Perhaps related to the recent discussion about pf errors?  This
> happened this morning, on a fully patched 7.2 amd64. dmesg and other
> info available on request.
> 
> ddb{0}> bt
> db_enter() at db_enter+0x10
> panic(81f30a19) at panic+0xbf
> __assert(81fa0761,81fd3154,637,81f6b8f7) at 
> __assert+0x
> 25
> pfsync_grab_snapshot(8000211e0ee0,80b57000) at 
> pfsync_grab_snapshot
> +0x308
> pfsync_sendout() at pfsync_sendout+0x89
> pfsync_update_state(fd905d068020) at pfsync_update_state+0x15b
> pf_test(2,3,80b65000,8000211e1258) at pf_test+0x117a
> ip_output(fd806e7ce500,0,8000211e13e8,1,0,0,e42ab41b1c818e9c) at 
> ip_out
> put+0x6b7
> ip_forward(fd806e7ce500,80b53800,fd8d52297390,0) at 
> ip_forward+
> 0x2da
> ip_input_if(8000211e1528,8000211e1534,4,0,80b53800) at 
> ip_input
> _if+0x35c
> ipv4_input(80b53800,fd806e7ce500) at ipv4_input+0x39
> ether_input(80b53800,fd806e7ce500) at ether_input+0x3b1
> carp_input(80b67800,fd806e7ce500,5e000106) at carp_input+0x196
> ether_input(80b67800,fd806e7ce500) at ether_input+0x1d9
> vlan_input(80b39000,fd806e7ce500,8000211e175c) at 
> vlan_input+0x
> 23d
> ether_input(80b39000,fd806e7ce500) at ether_input+0x85
> if_input_process(802c5048,8000211e17f8) at if_input_process+0x6f
> ifiq_process(802c8600) at ifiq_process+0x69
> taskq_thread(80033200) at taskq_thread+0x100
> end trace frame: 0x0, count: -19
> ddb{0}> show panic
> *cpu0: kernel diagnostic assertion "st->snapped == 0" failed: file 
> "/usr/src/sy
> s/net/if_pfsync.c", line 1591
>  cpu3: kernel diagnostic assertion "st->snapped == 0" failed: file 
> "/usr/src/sy
> s/net/if_pfsync.c", line 1591
>  cpu1: kernel diagnostic assertion "st->snapped == 0" failed: file 
> "/usr/src/sy
> s/net/if_pfsync.c", line 1591
> ddb{0}> 
> 
> --lyndon
> 



Re: pf(4) drops valid IGMP/MLD messages

2023-03-03 Thread Alexandr Nedvedicky
Hello,
On Fri, Mar 03, 2023 at 09:14:38PM +0100, Alexander Bluhm wrote:
> On Fri, Mar 03, 2023 at 08:54:51AM +0100, Luca Di Gregorio wrote:
> > Hi, just another bit of info about this issue.
> 
> Instead of implementing more and more details of RFC, we should
> discuss what the goal is.
> 
> We had a pf that dropped valid IGMP packets due to router-alert
> option.  So I added code to pass them, but still block other options.
> This is neccessary for working multicast.
> 
> Then sashan@ commited a diff that blocks packets with bad TTL or
> destination address.  This was too strict for some use cases as we
> see now.  But it may improve security.
> 
> What are the IGMP illegal packets that an attacker might use?  We
> should drop them.  This IGMP logic is deep down in pf that a user
> cannot override.  So we should be careful not to destroy valid use
> cases.  Replacing || with && somehow in the if condition looks
> reasonalbe to me.
> 

My commit was wrong. It makes pf(4) to drop all IGMP packets which 
either have TTL other than 1
or are not sent to multicast address

this is what we currently have in pf_walk_header() 

6849 /* IGMP packets have router alert options, allow them */
6850 if (pd->proto == IPPROTO_IGMP) {
6851 /* According to RFC 1112 ttl must be set to 1. */
6852 if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
6853 DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
6854 REASON_SET(reason, PFRES_IPOPTIONS);
6855 return (PF_DROP);
6856 }
6857 CLR(pd->badopts, PF_OPT_ROUTER_ALERT);
6858 }

I think Luca is right pf(4) is too paranoid breaking IGMP. The RFC 1112.
This is what 'Informal Protocol Description' section says:

   Multicast routers send Host Membership Query messages (hereinafter
   called Queries) to discover which host groups have members on their
   attached local networks.  Queries are addressed to the all-hosts
   group (address 224.0.0.1), and carry an IP time-to-live of 1.

In my commit I confused any multicast address with 'all-hosts' group
(224.0.0.1)

So we either want to revert that commit or fix it. I think the condition at
line 6850 should read as follows:

6847 /* IGMP packets have router alert options, allow them */
6848 if (pd->proto == IPPROTO_IGMP) {
6849 /*
6850  * According to RFC 1112 ttl must be set to 1 in all IGMP
6851  * packets sent do 224.0.0.1
6852  */
6853 if ((h->ip_ttl != 1) &&
6854 (h->ip_dst.s_addr == INADDR_ALLHOSTS_GROUP)) {
6855 DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
6856 REASON_SET(reason, PFRES_IPOPTIONS);
6857 return (PF_DROP);
6858 }
6859 CLR(pd->badopts, PF_OPT_ROUTER_ALERT);

This change should make pf(4) reasonably paranoid while keeping  IGMP working.

thanks and
regards
sashan



Re: pf(4) drops valid IGMP/MLD messages

2023-02-28 Thread Alexandr Nedvedicky
Hello Matthieu,



On Tue, Feb 28, 2023 at 02:05:50PM +0100, Matthieu Herrb wrote:
> > --- a/sys/net/pf.c
> > +++ b/sys/net/pf.c
> > @@ -6846,8 +6846,12 @@ pf_walk_header(struct pf_pdesc *pd, struct ip *h, 
> > u_short *reason)
> > pd->proto = h->ip_p;
> > /* IGMP packets have router alert options, allow them */
> > if (pd->proto == IPPROTO_IGMP) {
> > -   /* According to RFC 1112 ttl must be set to 1. */
> > -   if ((h->ip_ttl != 1) || !IN_MULTICAST(h->ip_dst.s_addr)) {
> > +   /*
> > +* According to RFC 1112 ttl must be set to 1 in all IGMP
> > +* packets sent do 224.0.0.1
> > +*/
> > +   if ((h->ip_ttl != 1) &&
> > +   (h->ip_dst.s_addr == INADDR_ALLHOSTS_GROUP)) {
> > DPFPRINTF(LOG_NOTICE, "Invalid IGMP");
> > REASON_SET(reason, PFRES_IPOPTIONS);
> > return (PF_DROP);
> > 
> 
> 
> Hi,
> 
> The expression look wrong to me again.
> I read in RFC1112 that correct packets should have:
> 
>   (h->ip_ttl == 1 && h->ip.ip_dst.s_addr == INADDR_ALLHOST_GROUP)

the expression above is true for for Query/Report messages
which are a sub-set of IGMP protocol. See Luca's emails with
references to RFCs which further extend IGMP protocol.

here in this particular place (pf_walk_header()) I think we really
want to simply discard all IGMP datagrams sent to 224.0.0.1 with TTL
different to 1.  anything else should be passed further up and let pf(4)
rules to make a decision on those cases.

thanks and
regards
sashan



Re: DIOCGETRULE is slow for large rulesets (2/3)

2023-04-26 Thread Alexandr Nedvedicky
Hello,

On Thu, Apr 27, 2023 at 06:51:52AM +1000, David Gwynne wrote:

> > is that kind of check in KASSET() something you have on your mind?
> > perhaps I can trade KASSERT() to regular code:
> > 
> > if (t->pft_unit != minor(dev))
> > return (EPERM);
> 
> i would pass the dev/minor/unit to pf_find_trans() and compare along
> with the ticket value as a matter of course. returning a different
> errno if the minor is different is unecessary.
> 

something like this?

struct pf_trans *
pf_find_trans(uint32_t unit, uint64_t ticket)
{
struct pf_trans *t;

rw_assert_anylock(_rw);

LIST_FOREACH(t, _ioctl_trans, pft_entry) {
if (t->pft_ticket == ticket)
break;
}

if (t->pft_unit != unit)
return (NULL);

return (t);
}

just return NULL on unit mismatch.  updated diff is below.


thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 7ea22050506..ebe1b912766 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -117,6 +117,11 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int pf_states_get(struct pfioc_states *);
 
+struct pf_trans*pf_open_trans(uint32_t);
+struct pf_trans*pf_find_trans(uint32_t, uint64_t);
+voidpf_free_trans(struct pf_trans *);
+voidpf_rollback_trans(struct pf_trans *);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -168,6 +173,8 @@ int  pf_rtlabel_add(struct pf_addr_wrap *);
 voidpf_rtlabel_remove(struct pf_addr_wrap *);
 voidpf_rtlabel_copyout(struct pf_addr_wrap *);
 
+uint64_t trans_ticket = 1;
+LIST_HEAD(, pf_trans)  pf_ioctl_trans = LIST_HEAD_INITIALIZER(pf_trans);
 
 void
 pfattach(int num)
@@ -293,6 +300,25 @@ pfopen(dev_t dev, int flags, int fmt, struct proc *p)
 int
 pfclose(dev_t dev, int flags, int fmt, struct proc *p)
 {
+   struct pf_trans *w, *s;
+   LIST_HEAD(, pf_trans)   tmp_list;
+   uint32_t unit = minor(dev);
+
+   LIST_INIT(_list);
+   rw_enter_write(_rw);
+   LIST_FOREACH_SAFE(w, _ioctl_trans, pft_entry, s) {
+   if (w->pft_unit == unit) {
+   LIST_REMOVE(w, pft_entry);
+   LIST_INSERT_HEAD(_list, w, pft_entry);
+   }
+   }
+   rw_exit_write(_rw);
+
+   while ((w = LIST_FIRST(_list)) != NULL) {
+   LIST_REMOVE(w, pft_entry);
+   pf_free_trans(w);
+   }
+
return (0);
 }
 
@@ -522,7 +548,7 @@ pf_qid_unref(u_int16_t qid)
 }
 
 int
-pf_begin_rules(u_int32_t *version, const char *anchor)
+pf_begin_rules(u_int32_t *ticket, const char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -533,20 +559,20 @@ pf_begin_rules(u_int32_t *version, const char *anchor)
pf_rm_rule(rs->rules.inactive.ptr, rule);
rs->rules.inactive.rcount--;
}
-   *version = ++rs->rules.inactive.version;
+   *ticket = ++rs->rules.inactive.ticket;
rs->rules.inactive.open = 1;
return (0);
 }
 
 void
-pf_rollback_rules(u_int32_t version, char *anchor)
+pf_rollback_rules(u_int32_t ticket, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   rs->rules.inactive.version != version)
+   rs->rules.inactive.ticket != ticket)
return;
while ((rule = TAILQ_FIRST(rs->rules.inactive.ptr)) != NULL) {
pf_rm_rule(rs->rules.inactive.ptr, rule);
@@ -825,7 +851,7 @@ pf_hash_rule(MD5_CTX *ctx, struct pf_rule *rule)
 }
 
 int
-pf_commit_rules(u_int32_t version, char *anchor)
+pf_commit_rules(u_int32_t ticket, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -834,7 +860,7 @@ pf_commit_rules(u_int32_t version, char *anchor)
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   version != rs->rules.inactive.version)
+   ticket != rs->rules.inactive.ticket)
return (EBUSY);
 
if (rs == _main_ruleset)
@@ -849,7 +875,7 @@ pf_commit_rules(u_int32_t version, char *anchor)
rs->rules.inactive.ptr = old_rules;
rs->rules.inactive.rcount = old_rcount;
 
-   rs->rules.active.version = rs->rules.inactive.version;
+   rs->rules.active.ticket = rs->rules.inactive.ticket;
pf_calc_skip_steps(rs->rules.active.ptr);
 
 
@@ -1142,10 +1168,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)

Re: DIOCGETRULE is slow for large rulesets (3/3)

2023-04-26 Thread Alexandr Nedvedicky
Hello,

updated diff below adopts to recent changes to pf_find_trans() in diff 2/3 [1]

in case DIOCGETRULE does not find desired transaction the ioctl(2) call
to /dev/pf returns ENXIO.

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-tech=168254555211083=2

8<---8<---8<--8<
diff --git a/sbin/pfctl/pfctl.c b/sbin/pfctl/pfctl.c
index 8cbd9d77b4f..3787e97a6b1 100644
--- a/sbin/pfctl/pfctl.c
+++ b/sbin/pfctl/pfctl.c
@@ -837,7 +837,7 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
 char *anchorname, int depth, int wildcard, long shownr)
 {
struct pfioc_rule pr;
-   u_int32_t nr, mnr, header = 0;
+   u_int32_t header = 0;
int len = strlen(path), ret = 0;
char *npath, *p;
 
@@ -893,24 +893,9 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
goto error;
}
 
-   if (shownr < 0) {
-   mnr = pr.nr;
-   nr = 0;
-   } else if (shownr < pr.nr) {
-   nr = shownr;
-   mnr = shownr + 1;
-   } else {
-   warnx("rule %ld not found", shownr);
-   ret = -1;
-   goto error;
-   }
-   for (; nr < mnr; ++nr) {
-   pr.nr = nr;
-   if (ioctl(dev, DIOCGETRULE, ) == -1) {
-   warn("DIOCGETRULE");
-   ret = -1;
-   goto error;
-   }
+   while (ioctl(dev, DIOCGETRULE, ) != -1) {
+   if ((shownr != -1) && (shownr != pr.nr))
+   continue;
 
/* anchor is the same for all rules in it */
if (pr.rule.anchor_wildcard == 0)
@@ -968,6 +953,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
case PFCTL_SHOW_NOTHING:
break;
}
+   errno = 0;
+   }
+
+   if ((errno != 0) && (errno != ENOENT)) {
+   warn("DIOCGETRULE");
+   ret = -1;
+   goto error;
}
 
/*
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index ebe1b912766..0f78c5b12ac 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -122,6 +122,10 @@ struct pf_trans*pf_find_trans(uint32_t, 
uint64_t);
 voidpf_free_trans(struct pf_trans *);
 voidpf_rollback_trans(struct pf_trans *);
 
+voidpf_init_tgetrule(struct pf_trans *,
+   struct pf_anchor *, uint32_t, struct pf_rule *);
+voidpf_cleanup_tgetrule(struct pf_trans *t);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -548,7 +552,7 @@ pf_qid_unref(u_int16_t qid)
 }
 
 int
-pf_begin_rules(u_int32_t *ticket, const char *anchor)
+pf_begin_rules(u_int32_t *version, const char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -559,20 +563,20 @@ pf_begin_rules(u_int32_t *ticket, const char *anchor)
pf_rm_rule(rs->rules.inactive.ptr, rule);
rs->rules.inactive.rcount--;
}
-   *ticket = ++rs->rules.inactive.ticket;
+   *version = ++rs->rules.inactive.version;
rs->rules.inactive.open = 1;
return (0);
 }
 
 void
-pf_rollback_rules(u_int32_t ticket, char *anchor)
+pf_rollback_rules(u_int32_t version, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   rs->rules.inactive.ticket != ticket)
+   rs->rules.inactive.version != version)
return;
while ((rule = TAILQ_FIRST(rs->rules.inactive.ptr)) != NULL) {
pf_rm_rule(rs->rules.inactive.ptr, rule);
@@ -851,7 +855,7 @@ pf_hash_rule(MD5_CTX *ctx, struct pf_rule *rule)
 }
 
 int
-pf_commit_rules(u_int32_t ticket, char *anchor)
+pf_commit_rules(u_int32_t version, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -860,7 +864,7 @@ pf_commit_rules(u_int32_t ticket, char *anchor)
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   ticket != rs->rules.inactive.ticket)
+   version != rs->rules.inactive.version)
return (EBUSY);
 
if (rs == _main_ruleset)
@@ -875,7 +879,7 @@ pf_commit_rules(u_int32_t ticket, char *anchor)
rs->rules.inactive.ptr = old_rules;
rs->rules.inactive.rcount = old_rcount;
 
-   rs->rules.active.ticket = rs->rules.inactive.ticket;
+   rs->rules.active.version = rs->rules.inactive.version;
pf_calc_skip_steps(rs->rules.active.ptr);
 
 
@@ -1214,7 +1218,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
 
NET_LOCK();
PF_LOCK();
-  

Re: DIOCGETRULE is slow for large rulesets (2/3)

2023-04-27 Thread Alexandr Nedvedicky
Hello,

Hello,

On Thu, Apr 27, 2023 at 10:41:53AM +1000, David Gwynne wrote:

> 
> t could be NULL here. just do the unit check inside the loop?
> 
> > 
> > if (t->pft_unit != unit)
> > return (NULL);
> > 
> > return (t);
> > }
> > 
> > just return NULL on unit mismatch.  updated diff is below.
> 
> ok once you fix the nit above.
> 

well spotted. This is the change I made to get 2/3 patch
fixed:

diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index ebe1b912766..b527610737b 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -3291,13 +3291,10 @@ pf_find_trans(uint32_t unit, uint64_t ticket)
rw_assert_anylock(_rw);
 
LIST_FOREACH(t, _ioctl_trans, pft_entry) {
-   if (t->pft_ticket == ticket)
+   if (t->pft_ticket == ticket && t->pft_unit == unit)
break;
}
 
-   if (t->pft_unit != unit)
-   return (NULL);
-
return (t);
 }


thanks and
regards
sashan



DIOCGETRULE is slow for large rulesets (1/3)

2023-04-25 Thread Alexandr Nedvedicky
Hello,

below is diff which renames ruleset member `ticket` to `version`.
the reason for this is to keep things clean. The word `ticket`
will be used to identify transaction, while newly introduced `version`
identifies change of particular pf object (ruleset).

diff below is simple find/replace. It changes `ticket` to `version`
at pf_ruleset used by kernel I don't want to drag this change to
pfctl.

OK?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 1141069dcf6..7ea22050506 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -522,7 +522,7 @@ pf_qid_unref(u_int16_t qid)
 }
 
 int
-pf_begin_rules(u_int32_t *ticket, const char *anchor)
+pf_begin_rules(u_int32_t *version, const char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -533,20 +533,20 @@ pf_begin_rules(u_int32_t *ticket, const char *anchor)
pf_rm_rule(rs->rules.inactive.ptr, rule);
rs->rules.inactive.rcount--;
}
-   *ticket = ++rs->rules.inactive.ticket;
+   *version = ++rs->rules.inactive.version;
rs->rules.inactive.open = 1;
return (0);
 }
 
 void
-pf_rollback_rules(u_int32_t ticket, char *anchor)
+pf_rollback_rules(u_int32_t version, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   rs->rules.inactive.ticket != ticket)
+   rs->rules.inactive.version != version)
return;
while ((rule = TAILQ_FIRST(rs->rules.inactive.ptr)) != NULL) {
pf_rm_rule(rs->rules.inactive.ptr, rule);
@@ -825,7 +825,7 @@ pf_hash_rule(MD5_CTX *ctx, struct pf_rule *rule)
 }
 
 int
-pf_commit_rules(u_int32_t ticket, char *anchor)
+pf_commit_rules(u_int32_t version, char *anchor)
 {
struct pf_ruleset   *rs;
struct pf_rule  *rule;
@@ -834,7 +834,7 @@ pf_commit_rules(u_int32_t ticket, char *anchor)
 
rs = pf_find_ruleset(anchor);
if (rs == NULL || !rs->rules.inactive.open ||
-   ticket != rs->rules.inactive.ticket)
+   version != rs->rules.inactive.version)
return (EBUSY);
 
if (rs == _main_ruleset)
@@ -849,7 +849,7 @@ pf_commit_rules(u_int32_t ticket, char *anchor)
rs->rules.inactive.ptr = old_rules;
rs->rules.inactive.rcount = old_rcount;
 
-   rs->rules.active.ticket = rs->rules.inactive.ticket;
+   rs->rules.active.version = rs->rules.inactive.version;
pf_calc_skip_steps(rs->rules.active.ptr);
 
 
@@ -1191,7 +1191,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
 
NET_LOCK();
PF_LOCK();
-   pq->ticket = pf_main_ruleset.rules.active.ticket;
+   pq->ticket = pf_main_ruleset.rules.active.version;
 
/* save state to not run over them all each time? */
qs = TAILQ_FIRST(pf_queues_active);
@@ -1212,7 +1212,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
 
NET_LOCK();
PF_LOCK();
-   if (pq->ticket != pf_main_ruleset.rules.active.ticket) {
+   if (pq->ticket != pf_main_ruleset.rules.active.version) {
error = EBUSY;
PF_UNLOCK();
NET_UNLOCK();
@@ -1243,7 +1243,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
 
NET_LOCK();
PF_LOCK();
-   if (pq->ticket != pf_main_ruleset.rules.active.ticket) {
+   if (pq->ticket != pf_main_ruleset.rules.active.version) {
error = EBUSY;
PF_UNLOCK();
NET_UNLOCK();
@@ -1290,7 +1290,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
 
NET_LOCK();
PF_LOCK();
-   if (q->ticket != pf_main_ruleset.rules.inactive.ticket) {
+   if (q->ticket != pf_main_ruleset.rules.inactive.version) {
error = EBUSY;
PF_UNLOCK();
NET_UNLOCK();
@@ -1386,7 +1386,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
pf_rule_free(rule);
goto fail;
}
-   if (pr->ticket != ruleset->rules.inactive.ticket) {
+   if (pr->ticket != ruleset->rules.inactive.version) {
error = EBUSY;
PF_UNLOCK();
NET_UNLOCK();
@@ -1464,7 +1464,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
pr->nr = tail->nr + 1;
else
pr->nr = 0;
-  

DIOCGETRULE is slow for large rulesets (2/3)

2023-04-25 Thread Alexandr Nedvedicky
Hello,

This is the second diff. It introduces a transaction (pf_trans).
It's more or less diff with dead code.

It's still worth to note those two chunks in this diff:

@@ -1142,10 +1172,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
return (EACCES);
}
 
-   if (flags & FWRITE)
-   rw_enter_write(_rw);
-   else
-   rw_enter_read(_rw);
+   rw_enter_write(_rw);
 
switch (cmd) {
 
@@ -3022,10 +3049,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
break;
}
 fail:
-   if (flags & FWRITE)
-   rw_exit_write(_rw);
-   else
-   rw_exit_read(_rw);
+   rw_exit_write(_rw);
 
`pfioctl_rw` serializes processes which perform ioctl(2) to pf(4).
I'd like to also this lock to serialize access to table of transactions.
To keep things simple for now I'd like to make every process to perform
ioctl(2) on pf(4) exclusively. I plan to revisit this later when I'll
be pushing operation on pfioctl_rw further down to individual ioctl
operations.

the operations pf_open_trans(), pf_find_trans(), pf_rollback_trans()
introduced in this diff are unused. They will be brought alive in
the 3rd diff.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 7ea22050506..7e4c7d2a2ab 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -117,6 +117,11 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int pf_states_get(struct pfioc_states *);
 
+struct pf_trans*pf_open_trans(pid_t);
+struct pf_trans*pf_find_trans(uint64_t);
+voidpf_free_trans(struct pf_trans *);
+voidpf_rollback_trans(struct pf_trans *);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -168,6 +173,8 @@ int  pf_rtlabel_add(struct pf_addr_wrap *);
 voidpf_rtlabel_remove(struct pf_addr_wrap *);
 voidpf_rtlabel_copyout(struct pf_addr_wrap *);
 
+uint64_t trans_ticket = 1;
+LIST_HEAD(, pf_trans)  pf_ioctl_trans = LIST_HEAD_INITIALIZER(pf_trans);
 
 void
 pfattach(int num)
@@ -293,6 +300,29 @@ pfopen(dev_t dev, int flags, int fmt, struct proc *p)
 int
 pfclose(dev_t dev, int flags, int fmt, struct proc *p)
 {
+   struct pf_trans *w, *s;
+   LIST_HEAD(, pf_trans)   tmp_list;
+
+   if (minor(dev) >= 1)
+   return (ENXIO);
+
+   if (flags & FWRITE) {
+   LIST_INIT(_list);
+   rw_enter_write(_rw);
+   LIST_FOREACH_SAFE(w, _ioctl_trans, pft_entry, s) {
+   if (w->pft_pid == p->p_p->ps_pid) {
+   LIST_REMOVE(w, pft_entry);
+   LIST_INSERT_HEAD(_list, w, pft_entry);
+   }
+   }
+   rw_exit_write(_rw);
+
+   while ((w = LIST_FIRST(_list)) != NULL) {
+   LIST_REMOVE(w, pft_entry);
+   pf_free_trans(w);
+   }
+   }
+
return (0);
 }
 
@@ -1142,10 +1172,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
return (EACCES);
}
 
-   if (flags & FWRITE)
-   rw_enter_write(_rw);
-   else
-   rw_enter_read(_rw);
+   rw_enter_write(_rw);
 
switch (cmd) {
 
@@ -3022,10 +3049,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
break;
}
 fail:
-   if (flags & FWRITE)
-   rw_exit_write(_rw);
-   else
-   rw_exit_read(_rw);
+   rw_exit_write(_rw);
 
return (error);
 }
@@ -3244,3 +3268,52 @@ pf_sysctl(void *oldp, size_t *oldlenp, void *newp, 
size_t newlen)
 
return sysctl_rdstruct(oldp, oldlenp, newp, , sizeof(pfs));
 }
+
+struct pf_trans *
+pf_open_trans(pid_t pid)
+{
+   static uint64_t ticket = 1;
+   struct pf_trans *t;
+
+   rw_assert_wrlock(_rw);
+
+   t = malloc(sizeof(*t), M_TEMP, M_WAITOK);
+   memset(t, 0, sizeof(struct pf_trans));
+   t->pft_pid = pid;
+   t->pft_ticket = ticket++;
+
+   LIST_INSERT_HEAD(_ioctl_trans, t, pft_entry);
+
+   return (t);
+}
+
+struct pf_trans *
+pf_find_trans(uint64_t ticket)
+{
+   struct pf_trans *t;
+
+   rw_assert_anylock(_rw);
+
+   LIST_FOREACH(t, _ioctl_trans, pft_entry) {
+   if (t->pft_ticket == ticket)
+   break;
+   }
+
+   return (t);
+}
+
+void
+pf_free_trans(struct pf_trans *t)
+{
+   free(t, M_TEMP, sizeof(*t));
+}
+
+void
+pf_rollback_trans(struct pf_trans *t)
+{
+   if (t != NULL) {
+   rw_assert_wrlock(_rw);
+   

DIOCGETRULE is slow for large rulesets (complete diff)

2023-04-25 Thread Alexandr Nedvedicky
Hello,

below is a complete diff which makes DIOCGETRULE bit more sane.
This is the complete change I'd like to commit. It is also more
convenient for people who want to test the diff.

when running 'pfctl -sr' to show rules pfctl performs several
ioctl(2) calls. The first call is DIOCGETRULES to find out
the number of rules in ruleset. Then it performs 'n' DIOCGETRULE
commands to retrieve all rules from ruleset one-by-one. Let's
look at DIOCGETRULE code to see how bad things are for extremely
large rulesets:

1489 if (pr->ticket != ruleset->rules.active.ticket) {
1490 error = EBUSY;
1491 PF_UNLOCK();
1492 NET_UNLOCK();
1493 goto fail;
1494 }
1495 rule = TAILQ_FIRST(ruleset->rules.active.ptr);
1496 while ((rule != NULL) && (rule->nr != pr->nr))
1497 rule = TAILQ_NEXT(rule, entries);
1498 if (rule == NULL) {
1499 error = EBUSY;
1500 PF_UNLOCK();
1501 NET_UNLOCK();
1502 goto fail;
1503 }

to retrieve n-th rule DIOCGETRULE ioctl iterates over a ruleset
from beginning. This is OK for conventional rulesets. For large
rulesets the delay to retrieve all rules becomes noticable.
to measure it I've decided to create extreme ruleset with 26200 rules.
It took ~10 seconds on current:

current:
pf# time pfctl -sr |wc -l
   26200
0m10.80s real 0m00.38s user 0m10.44s system

when I apply diff below I'm getting significantly better result:

fixed:
pf# time pfctl.test -sr |wc -l
   26200
0m00.49s real 0m00.15s user 0m00.31s system

The proposed fix introduces transaction (pf_trans) to pf(4)
ioctl(2) subsystem. I use transactions in my tree to update
pf(4) configuration, but this part is not ready for review yet.

Process which opens /dev/pf may create one ore more transactions.
Transaction is either closed when process is done with change
or when closes /dev/pf. This should work, because /dev/pf
is cleanable  since last November [1].

DIOCGETRULE operation uses transaction to keep a reference
to ruleset, ruleset version and pointer to next rule to retrieve,
so we no longer need to iterate from beginning to retrieve n-th rule.

when 'pfctl -sr' retrieves the ruleset it first issues DIOCGETRULES
ioctl(2). It receives ticket (transaction id) which must be passed
as parameter to subsequent DIOCGETRULE requests. It continues
to read rules one-by-one using DIOCGETRULE until ENOENT is seen
which indicates all rules got retrieved.

diff below is complete change. I'm going to send partial
diffs for review asking for OKs.

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-cvs=166773945126422=2

diff --git a/sbin/pfctl/pfctl.c b/sbin/pfctl/pfctl.c
index 8cbd9d77b4f..3787e97a6b1 100644
--- a/sbin/pfctl/pfctl.c
+++ b/sbin/pfctl/pfctl.c
@@ -837,7 +837,7 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
 char *anchorname, int depth, int wildcard, long shownr)
 {
struct pfioc_rule pr;
-   u_int32_t nr, mnr, header = 0;
+   u_int32_t header = 0;
int len = strlen(path), ret = 0;
char *npath, *p;
 
@@ -893,24 +893,9 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
goto error;
}
 
-   if (shownr < 0) {
-   mnr = pr.nr;
-   nr = 0;
-   } else if (shownr < pr.nr) {
-   nr = shownr;
-   mnr = shownr + 1;
-   } else {
-   warnx("rule %ld not found", shownr);
-   ret = -1;
-   goto error;
-   }
-   for (; nr < mnr; ++nr) {
-   pr.nr = nr;
-   if (ioctl(dev, DIOCGETRULE, ) == -1) {
-   warn("DIOCGETRULE");
-   ret = -1;
-   goto error;
-   }
+   while (ioctl(dev, DIOCGETRULE, ) != -1) {
+   if ((shownr != -1) && (shownr != pr.nr))
+   continue;
 
/* anchor is the same for all rules in it */
if (pr.rule.anchor_wildcard == 0)
@@ -968,6 +953,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
case PFCTL_SHOW_NOTHING:
break;
}
+   errno = 0;
+   }
+
+   if ((errno != 0) && (errno != ENOENT)) {
+   warn("DIOCGETRULE");
+   ret = -1;
+   goto error;
}
 
/*
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 1141069dcf6..50286dbcb96 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -117,6 +117,15 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int 

DIOCGETRULE is slow for large rulesets (3/3)

2023-04-25 Thread Alexandr Nedvedicky
Hello,

this is 3rd of three diffs to improve DIOCGETRULE ioctl operation.
the summary of changes done here is as follows:
- add new members to pf_trans structure so it can
  hold data for DIOCGETRULE operation

- DIOCGETRULES opens transaction and returns ticket/transaction id
  to pfctl. pfctl uses ticket in DIOCGETRULE to identify transaction
  when it retrieves the rule

- change introduces new reference counter to pf_anchor which is used
  by pf_trans to keep reference to pf_anchor object.

- DIOCGETRULES opens transaction and stores a reference to anchor
  with current ruleset version and pointer to the first rule

- DIOCGETRULE looks up transaction, verifies version number is
  same so it can safely access next rule.

- pfctl when handling 'pfctl -sr -R n' must iterate to
  n-th rule one-by-one.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sbin/pfctl/pfctl.c b/sbin/pfctl/pfctl.c
index 8cbd9d77b4f..3787e97a6b1 100644
--- a/sbin/pfctl/pfctl.c
+++ b/sbin/pfctl/pfctl.c
@@ -837,7 +837,7 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
 char *anchorname, int depth, int wildcard, long shownr)
 {
struct pfioc_rule pr;
-   u_int32_t nr, mnr, header = 0;
+   u_int32_t header = 0;
int len = strlen(path), ret = 0;
char *npath, *p;
 
@@ -893,24 +893,9 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
goto error;
}
 
-   if (shownr < 0) {
-   mnr = pr.nr;
-   nr = 0;
-   } else if (shownr < pr.nr) {
-   nr = shownr;
-   mnr = shownr + 1;
-   } else {
-   warnx("rule %ld not found", shownr);
-   ret = -1;
-   goto error;
-   }
-   for (; nr < mnr; ++nr) {
-   pr.nr = nr;
-   if (ioctl(dev, DIOCGETRULE, ) == -1) {
-   warn("DIOCGETRULE");
-   ret = -1;
-   goto error;
-   }
+   while (ioctl(dev, DIOCGETRULE, ) != -1) {
+   if ((shownr != -1) && (shownr != pr.nr))
+   continue;
 
/* anchor is the same for all rules in it */
if (pr.rule.anchor_wildcard == 0)
@@ -968,6 +953,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
case PFCTL_SHOW_NOTHING:
break;
}
+   errno = 0;
+   }
+
+   if ((errno != 0) && (errno != ENOENT)) {
+   warn("DIOCGETRULE");
+   ret = -1;
+   goto error;
}
 
/*
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 7e4c7d2a2ab..50286dbcb96 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -122,6 +122,10 @@ struct pf_trans*pf_find_trans(uint64_t);
 voidpf_free_trans(struct pf_trans *);
 voidpf_rollback_trans(struct pf_trans *);
 
+voidpf_init_tgetrule(struct pf_trans *,
+   struct pf_anchor *, uint32_t, struct pf_rule *);
+voidpf_cleanup_tgetrule(struct pf_trans *t);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -1474,7 +1478,9 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
case DIOCGETRULES: {
struct pfioc_rule   *pr = (struct pfioc_rule *)addr;
struct pf_ruleset   *ruleset;
-   struct pf_rule  *tail;
+   struct pf_rule  *rule;
+   struct pf_trans *t;
+   u_int32_truleset_version;
 
NET_LOCK();
PF_LOCK();
@@ -1486,14 +1492,21 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
NET_UNLOCK();
goto fail;
}
-   tail = TAILQ_LAST(ruleset->rules.active.ptr, pf_rulequeue);
-   if (tail)
-   pr->nr = tail->nr + 1;
+   rule = TAILQ_LAST(ruleset->rules.active.ptr, pf_rulequeue);
+   if (rule)
+   pr->nr = rule->nr + 1;
else
pr->nr = 0;
-   pr->ticket = ruleset->rules.active.version;
+   ruleset_version = ruleset->rules.active.version;
+   pf_anchor_take(ruleset->anchor);
+   rule = TAILQ_FIRST(ruleset->rules.active.ptr);
PF_UNLOCK();
NET_UNLOCK();
+
+   t = pf_open_trans(p->p_p->ps_pid);
+   pf_init_tgetrule(t, ruleset->anchor, ruleset_version, rule);
+   pr->ticket = t->pft_ticket;
+
break;
}
 
@@ -1501,29 +1514,29 @@ pfioctl(dev_t dev, u_long cmd, 

Re: DIOCGETRULE is slow for large rulesets (2/3)

2023-04-26 Thread Alexandr Nedvedicky
Hello,

On Wed, Apr 26, 2023 at 11:51:26AM +, Gerhard Roth wrote:
> On Wed, 2023-04-26 at 13:47 +0200, Alexandr Nedvedicky wrote:

> > @@ -293,6 +300,28 @@ pfopen(dev_t dev, int flags, int fmt, struct proc *p)
> > ??int
> > ??pfclose(dev_t dev, int flags, int fmt, struct proc *p)
> > ??{
> > +??struct pf_trans *w, *s;
> > +??LIST_HEAD(, pf_trans)??tmp_list;
> > +??uint32_t unit = minor(dev);
> > +
> > +??if (minor(dev) >= 1)
> > +??return (ENXIO);
> 
> If you want to use the minor dev-id to release the transaction,
> the above two lines should go away.
> 
> 

yes, you are right. thanks for spotting that.

updated diff is below.

regards
sashan

8<---8<---8<--8<

diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 1141069dcf6..74af2b74ecb 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -117,6 +117,11 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int pf_states_get(struct pfioc_states *);
 
+struct pf_trans*pf_open_trans(uint32_t);
+struct pf_trans*pf_find_trans(uint64_t);
+voidpf_free_trans(struct pf_trans *);
+voidpf_rollback_trans(struct pf_trans *);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -168,6 +173,8 @@ int  pf_rtlabel_add(struct pf_addr_wrap *);
 voidpf_rtlabel_remove(struct pf_addr_wrap *);
 voidpf_rtlabel_copyout(struct pf_addr_wrap *);
 
+uint64_t trans_ticket = 1;
+LIST_HEAD(, pf_trans)  pf_ioctl_trans = LIST_HEAD_INITIALIZER(pf_trans);
 
 void
 pfattach(int num)
@@ -293,6 +300,25 @@ pfopen(dev_t dev, int flags, int fmt, struct proc *p)
 int
 pfclose(dev_t dev, int flags, int fmt, struct proc *p)
 {
+   struct pf_trans *w, *s;
+   LIST_HEAD(, pf_trans)   tmp_list;
+   uint32_t unit = minor(dev);
+
+   LIST_INIT(_list);
+   rw_enter_write(_rw);
+   LIST_FOREACH_SAFE(w, _ioctl_trans, pft_entry, s) {
+   if (w->pft_unit == unit) {
+   LIST_REMOVE(w, pft_entry);
+   LIST_INSERT_HEAD(_list, w, pft_entry);
+   }
+   }
+   rw_exit_write(_rw);
+
+   while ((w = LIST_FIRST(_list)) != NULL) {
+   LIST_REMOVE(w, pft_entry);
+   pf_free_trans(w);
+   }
+
return (0);
 }
 
@@ -1142,10 +1168,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
return (EACCES);
}
 
-   if (flags & FWRITE)
-   rw_enter_write(_rw);
-   else
-   rw_enter_read(_rw);
+   rw_enter_write(_rw);
 
switch (cmd) {
 
@@ -3022,10 +3045,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
break;
}
 fail:
-   if (flags & FWRITE)
-   rw_exit_write(_rw);
-   else
-   rw_exit_read(_rw);
+   rw_exit_write(_rw);
 
return (error);
 }
@@ -3244,3 +3264,52 @@ pf_sysctl(void *oldp, size_t *oldlenp, void *newp, 
size_t newlen)
 
return sysctl_rdstruct(oldp, oldlenp, newp, , sizeof(pfs));
 }
+
+struct pf_trans *
+pf_open_trans(uint32_t unit)
+{
+   static uint64_t ticket = 1;
+   struct pf_trans *t;
+
+   rw_assert_wrlock(_rw);
+
+   t = malloc(sizeof(*t), M_TEMP, M_WAITOK);
+   memset(t, 0, sizeof(struct pf_trans));
+   t->pft_unit = unit;
+   t->pft_ticket = ticket++;
+
+   LIST_INSERT_HEAD(_ioctl_trans, t, pft_entry);
+
+   return (t);
+}
+
+struct pf_trans *
+pf_find_trans(uint64_t ticket)
+{
+   struct pf_trans *t;
+
+   rw_assert_anylock(_rw);
+
+   LIST_FOREACH(t, _ioctl_trans, pft_entry) {
+   if (t->pft_ticket == ticket)
+   break;
+   }
+
+   return (t);
+}
+
+void
+pf_free_trans(struct pf_trans *t)
+{
+   free(t, M_TEMP, sizeof(*t));
+}
+
+void
+pf_rollback_trans(struct pf_trans *t)
+{
+   if (t != NULL) {
+   rw_assert_wrlock(_rw);
+   LIST_REMOVE(t, pft_entry);
+   pf_free_trans(t);
+   }
+}
diff --git a/sys/net/pfvar_priv.h b/sys/net/pfvar_priv.h
index 38fff6473aa..5af2027733a 100644
--- a/sys/net/pfvar_priv.h
+++ b/sys/net/pfvar_priv.h
@@ -322,6 +322,17 @@ enum {
 
 extern struct cpumem *pf_anchor_stack;
 
+enum pf_trans_type {
+   PF_TRANS_NONE,
+   PF_TRANS_MAX
+};
+
+struct pf_trans {
+   LIST_ENTRY(pf_trans)pft_entry;
+   uint32_tpft_unit;   /* process id */
+   uint64_tpft_ticket;
+   enum pf_trans_type  pft_type;
+};
 extern struct task pf_purge_task;
 extern struct timeout  pf_purge_to;
 



Re: DIOCGETRULE is slow for large rulesets (2/3)

2023-04-26 Thread Alexandr Nedvedicky
Hello,


On Wed, Apr 26, 2023 at 11:37:58AM +1000, David Gwynne wrote:
> >  fail:
> > -   if (flags & FWRITE)
> > -   rw_exit_write(_rw);
> > -   else
> > -   rw_exit_read(_rw);
> > +   rw_exit_write(_rw);
> 
> i dont think having the open mode flags affect whether you take a shared
> or exclusive lock makes sense anyway, so im happy with these bits.

while updating my diff I've noticed pfclose() needs
same change: dropping 'flags & FWRITE' test. This is
fixed i new diff below.


> 
>   int unit = minor(dev);
> 
>   if (unit & ((1 << CLONE_SHIFT) - 1))
>   return (ENXIO);
> 
> this has some ties into the second issue, which is that we shouldn't
> use the pid as an identifier for state across syscalls like this
> in the kernel. the reason is that userland could be weird or buggy
> (or hostile) and fork in between creating a transaction and ending
> a transaction, but it will still have the fd it from from open(/dev/pf).
> instead, we should be using the whole dev_t or the minor number,
> as long as it includes the bits that the cloning infrastructure
> uses, as the identifier.

in new diff I'm using a minor(dev) to identify transaction owner.
> 
> i would also check that the process^Wminor number that created a
> transaction is the same when looking up a transaction in pf_find_trans.

the pf_find_trans() is a dead code in diff below as it is
not being used there. Just to make sure I got things right
so I can update '3/3' diff in stack accordingly. Let's assume
snippet below comes from pfioctl() function

int 

  
pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, struct proc *p) {
...
t = pf_find_trans(ticket);
if (t == NULL)
return (ENXIO);

KASSERT(t->pft_unit == minor(dev));

is that kind of check in KASSET() something you have on your mind?
perhaps I can trade KASSERT() to regular code:

if (t->pft_unit != minor(dev))
return (EPERM);


thank you for pointers to bpf.c updated diff is below.

regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 1141069dcf6..b9904c545c5 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -117,6 +117,11 @@ voidpf_qid_unref(u_int16_t);
 int pf_states_clr(struct pfioc_state_kill *);
 int pf_states_get(struct pfioc_states *);
 
+struct pf_trans*pf_open_trans(uint32_t);
+struct pf_trans*pf_find_trans(uint64_t);
+voidpf_free_trans(struct pf_trans *);
+voidpf_rollback_trans(struct pf_trans *);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -168,6 +173,8 @@ int  pf_rtlabel_add(struct pf_addr_wrap *);
 voidpf_rtlabel_remove(struct pf_addr_wrap *);
 voidpf_rtlabel_copyout(struct pf_addr_wrap *);
 
+uint64_t trans_ticket = 1;
+LIST_HEAD(, pf_trans)  pf_ioctl_trans = LIST_HEAD_INITIALIZER(pf_trans);
 
 void
 pfattach(int num)
@@ -293,6 +300,28 @@ pfopen(dev_t dev, int flags, int fmt, struct proc *p)
 int
 pfclose(dev_t dev, int flags, int fmt, struct proc *p)
 {
+   struct pf_trans *w, *s;
+   LIST_HEAD(, pf_trans)   tmp_list;
+   uint32_t unit = minor(dev);
+
+   if (minor(dev) >= 1)
+   return (ENXIO);
+
+   LIST_INIT(_list);
+   rw_enter_write(_rw);
+   LIST_FOREACH_SAFE(w, _ioctl_trans, pft_entry, s) {
+   if (w->pft_unit == unit) {
+   LIST_REMOVE(w, pft_entry);
+   LIST_INSERT_HEAD(_list, w, pft_entry);
+   }
+   }
+   rw_exit_write(_rw);
+
+   while ((w = LIST_FIRST(_list)) != NULL) {
+   LIST_REMOVE(w, pft_entry);
+   pf_free_trans(w);
+   }
+
return (0);
 }
 
@@ -1142,10 +1171,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
return (EACCES);
}
 
-   if (flags & FWRITE)
-   rw_enter_write(_rw);
-   else
-   rw_enter_read(_rw);
+   rw_enter_write(_rw);
 
switch (cmd) {
 
@@ -3022,10 +3048,7 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
break;
}
 fail:
-   if (flags & FWRITE)
-   rw_exit_write(_rw);
-   else
-   rw_exit_read(_rw);
+   rw_exit_write(_rw);
 
return (error);
 }
@@ -3244,3 +3267,52 @@ pf_sysctl(void *oldp, size_t *oldlenp, void *newp, 
size_t newlen)
 
return sysctl_rdstruct(oldp, oldlenp, newp, , sizeof(pfs));
 }
+

Re: DIOCGETRULE is slow for large rulesets (3/3)

2023-04-26 Thread Alexandr Nedvedicky
Hello,

On Wed, Apr 26, 2023 at 01:25:45AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> this is 3rd of three diffs to improve DIOCGETRULE ioctl operation.
> the summary of changes done here is as follows:
> - add new members to pf_trans structure so it can
>   hold data for DIOCGETRULE operation
> 
> - DIOCGETRULES opens transaction and returns ticket/transaction id
>   to pfctl. pfctl uses ticket in DIOCGETRULE to identify transaction
>   when it retrieves the rule
> 
> - change introduces new reference counter to pf_anchor which is used
>   by pf_trans to keep reference to pf_anchor object.
> 
> - DIOCGETRULES opens transaction and stores a reference to anchor
>   with current ruleset version and pointer to the first rule
> 
> - DIOCGETRULE looks up transaction, verifies version number is
>   same so it can safely access next rule.
> 
> - pfctl when handling 'pfctl -sr -R n' must iterate to
>   n-th rule one-by-one.
> 

updating diff to accommodate changes in diff 2/3 [1] in the stack
of changes.

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-tech=168251129621089=2

8<---8<---8<--8<
diff --git a/sbin/pfctl/pfctl.c b/sbin/pfctl/pfctl.c
index 8cbd9d77b4f..3787e97a6b1 100644
--- a/sbin/pfctl/pfctl.c
+++ b/sbin/pfctl/pfctl.c
@@ -837,7 +837,7 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
 char *anchorname, int depth, int wildcard, long shownr)
 {
struct pfioc_rule pr;
-   u_int32_t nr, mnr, header = 0;
+   u_int32_t header = 0;
int len = strlen(path), ret = 0;
char *npath, *p;
 
@@ -893,24 +893,9 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
goto error;
}
 
-   if (shownr < 0) {
-   mnr = pr.nr;
-   nr = 0;
-   } else if (shownr < pr.nr) {
-   nr = shownr;
-   mnr = shownr + 1;
-   } else {
-   warnx("rule %ld not found", shownr);
-   ret = -1;
-   goto error;
-   }
-   for (; nr < mnr; ++nr) {
-   pr.nr = nr;
-   if (ioctl(dev, DIOCGETRULE, ) == -1) {
-   warn("DIOCGETRULE");
-   ret = -1;
-   goto error;
-   }
+   while (ioctl(dev, DIOCGETRULE, ) != -1) {
+   if ((shownr != -1) && (shownr != pr.nr))
+   continue;
 
/* anchor is the same for all rules in it */
if (pr.rule.anchor_wildcard == 0)
@@ -968,6 +953,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
pfctl_show format,
case PFCTL_SHOW_NOTHING:
break;
}
+   errno = 0;
+   }
+
+   if ((errno != 0) && (errno != ENOENT)) {
+   warn("DIOCGETRULE");
+   ret = -1;
+   goto error;
}
 
/*
diff --git a/sys/net/pf_ioctl.c b/sys/net/pf_ioctl.c
index 8ce0af53a89..d294a8a3b3c 100644
--- a/sys/net/pf_ioctl.c
+++ b/sys/net/pf_ioctl.c
@@ -122,6 +122,10 @@ struct pf_trans*pf_find_trans(uint64_t);
 voidpf_free_trans(struct pf_trans *);
 voidpf_rollback_trans(struct pf_trans *);
 
+voidpf_init_tgetrule(struct pf_trans *,
+   struct pf_anchor *, uint32_t, struct pf_rule *);
+voidpf_cleanup_tgetrule(struct pf_trans *t);
+
 struct pf_rule  pf_default_rule, pf_default_rule_new;
 
 struct {
@@ -1470,7 +1474,9 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
case DIOCGETRULES: {
struct pfioc_rule   *pr = (struct pfioc_rule *)addr;
struct pf_ruleset   *ruleset;
-   struct pf_rule  *tail;
+   struct pf_rule  *rule;
+   struct pf_trans *t;
+   u_int32_truleset_version;
 
NET_LOCK();
PF_LOCK();
@@ -1482,14 +1488,21 @@ pfioctl(dev_t dev, u_long cmd, caddr_t addr, int flags, 
struct proc *p)
NET_UNLOCK();
goto fail;
}
-   tail = TAILQ_LAST(ruleset->rules.active.ptr, pf_rulequeue);
-   if (tail)
-   pr->nr = tail->nr + 1;
+   rule = TAILQ_LAST(ruleset->rules.active.ptr, pf_rulequeue);
+   if (rule)
+   pr->nr = rule->nr + 1;
else
pr->nr = 0;
-   pr->ticket = ruleset->rules.active.version;
+   ruleset_version = ruleset->rules.active.version;
+

Re: pfioctl: DIOCGETQUEUE: drop net lock

2023-04-28 Thread Alexandr Nedvedicky
Hello,

On Fri, Apr 28, 2023 at 09:02:35PM +, Klemens Nanni wrote:
> Same logic and argument as for the parent *S ioctl, might as well have
> committed them together:
> ---
> Remove net lock from DIOCGETQUEUES
> 
> Both ticket and number of queues stem from the pf_queues_active list which
> is effectively static to pf_ioctl.c and fully protected by the pf lock.
> ---
> 
> OK?

looks good

OK sashan
> 
> 
> Index: pf_ioctl.c
> ===
> RCS file: /cvs/src/sys/net/pf_ioctl.c,v
> retrieving revision 1.401
> diff -u -p -U5 -r1.401 pf_ioctl.c
> --- pf_ioctl.c28 Apr 2023 14:08:38 -  1.401
> +++ pf_ioctl.c28 Apr 2023 20:52:54 -
> @@ -1229,32 +1229,28 @@ pfioctl(dev_t dev, u_long cmd, caddr_t a
>   case DIOCGETQUEUE: {
>   struct pfioc_queue  *pq = (struct pfioc_queue *)addr;
>   struct pf_queuespec *qs;
>   u_int32_tnr = 0;
>  
> - NET_LOCK();
>   PF_LOCK();
>   if (pq->ticket != pf_main_ruleset.rules.active.version) {
>   error = EBUSY;
>   PF_UNLOCK();
> - NET_UNLOCK();
>   goto fail;
>   }
>  
>   /* save state to not run over them all each time? */
>   qs = TAILQ_FIRST(pf_queues_active);
>   while ((qs != NULL) && (nr++ < pq->nr))
>   qs = TAILQ_NEXT(qs, entries);
>   if (qs == NULL) {
>   error = EBUSY;
>   PF_UNLOCK();
> - NET_UNLOCK();
>   goto fail;
>   }
>   memcpy(>queue, qs, sizeof(pq->queue));
>   PF_UNLOCK();
> - NET_UNLOCK();
>   break;
>   }
>  
>   case DIOCGETQSTATS: {
>   struct pfioc_qstats *pq = (struct pfioc_qstats *)addr;
> 



Re: pfctl + bgpd for loop ugliness

2023-04-21 Thread Alexandr Nedvedicky
Hello,

On Tue, Apr 18, 2023 at 02:43:26PM +0200, Theo Buehler wrote:
> On Tue, Apr 18, 2023 at 02:06:46PM +0200, Claudio Jeker wrote:
> > This and the others are IIRC streight from pfctl. So if someone wants a
> > free commit :) 
> 
> How about this. pfctl and bgpd are the same, except that the bgpd one
> has a bsearch() nitems on top.  pfctl regress is happy.
> 
> Index: sbin/pfctl/pfctl_parser.c
> ===
> RCS file: /cvs/src/sbin/pfctl/pfctl_parser.c,v
> retrieving revision 1.347
> diff -u -p -r1.347 pfctl_parser.c
> --- sbin/pfctl/pfctl_parser.c 9 Nov 2022 23:00:00 -   1.347
> +++ sbin/pfctl/pfctl_parser.c 18 Apr 2023 12:37:19 -
> @@ -62,6 +62,10 @@
>  #include "pfctl_parser.h"
>  #include "pfctl.h"
>  
> +#ifndef nitems
> +#define nitems(_a)   (sizeof((_a)) / sizeof((_a)[0]))
> +#endif
> +

I like it.

OK sashan



Re: divert packet checksum

2023-04-04 Thread Alexandr Nedvedicky
Hello,

On Mon, Apr 03, 2023 at 11:39:54PM +0200, Alexander Bluhm wrote:
> Hi,
> 
> When sending IP packets to userland with divert-packet rules, the
> checksum may be wrong.  Locally generated packets diverted by pf
> out rules may have no checksum due to to hardware offloading.
> IDS/IPS systems may complain about that.
> 
> Calculate the checksum in that case.

makes sense to me.
> 
> ok?

OK sashan
> 
> bluhm



Re: pf.conf bug

2023-02-06 Thread Alexandr Nedvedicky
Hello,
[ cc'ing also tech@ ]

On Mon, Feb 06, 2023 at 06:44:38PM +0300, r...@bh0.amt.ru wrote:
> >Synopsis:pf.conf bug
> >Category:system
> >Environment:
>   System  : OpenBSD 7.2
>   Details : OpenBSD 7.2 (GENERIC.MP) #6: Sat Jan 21 01:03:04 MST 2023
>
> r...@syspatch-72-amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
> 
>   Architecture: OpenBSD.amd64
>   Machine : amd64
> >Description:
>   The pfctl utility incorrectly configures the PF device from
>   a pf.conf file. The output below shows that instead of loading
>   the rule
>   "block drop in quick inet6 proto ipv6-icmp all icmp6-type 255"
>   the pfctl loads the rule
>   "block drop in quick inet6 proto ipv6-icmp all".
> >How-To-Repeat:
>   bgps# cat ./pf.conf.test
>   block in quick inet6 proto icmp6 all icmp6-type {0, 127, 255}
>   pass quick inet6 proto icmp6 all no state
>   pass all
>   bgps# pfctl -f ./pf.conf.test
>   bgps# pfctl -s rules
>   block drop in quick inet6 proto ipv6-icmp all icmp6-type 0
>   block drop in quick inet6 proto ipv6-icmp all icmp6-type 127
>   block drop in quick inet6 proto ipv6-icmp all
>   pass quick inet6 proto ipv6-icmp all no state
>   pass all flags S/SA

patch below fixes the issue. We need to change a type for
icmp*type and icmp*code from u_int8_t to u_int16_t to make
code correct. the thing is that legit values are <0, 256>
at the moment. The '0' has a special meaning matching 'any'
icmp type/code. Snippet below comes from parse.y which shows
how logic works:

3198 icmptype: STRING{
3199 const struct icmptypeent*p;
3200
3201 if ((p = geticmptypebyname($1, AF_INET)) == NULL) {
3202 yyerror("unknown icmp-type %s", $1);
3203 free($1);
3204 YYERROR;
3205 }
3206 $$ = p->type + 1;
3207 free($1);
3208 }
3209 | NUMBER
3210 if ($1 < 0 || $1 > 255) {
3211 yyerror("illegal icmp-type %lld", $1);
3212 YYERROR;
3213 }
3214 $$ = $1 + 1;
3215 }
3216 ;
3217
3218 icmp6type   : STRING{

if we want to allow firewall administrator to specify a match
on icmptype 255 then extending type from uint8_t to uint16_t
is the right change.

another option is to change logic here to allow matching icmptype in
range <0, 254>

either way is fine for me. however my preference is leaning
towards a type change.

note diff also rops pad[2] member to compensate for change
of uint8_t to uin16_t for type, code members. I'm not sure
about this bit hence I'd like to bring it up here.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sbin/pfctl/parse.y b/sbin/pfctl/parse.y
index 2c5a49772ff..898bded8c24 100644
--- a/sbin/pfctl/parse.y
+++ b/sbin/pfctl/parse.y
@@ -134,8 +134,8 @@ struct node_gid {
 };
 
 struct node_icmp {
-   u_int8_t code;
-   u_int8_t type;
+   u_int16_tcode;  /* aux. value 256 is legit */
+   u_int16_ttype;  /* aux. value 256 is legit */
u_int8_t proto;
struct node_icmp*next;
struct node_icmp*tail;
diff --git a/sys/net/pfvar.h b/sys/net/pfvar.h
index 1453bc35c04..1efb1b5221c 100644
--- a/sys/net/pfvar.h
+++ b/sys/net/pfvar.h
@@ -572,8 +572,8 @@ struct pf_rule {
u_int8_t keep_state;
sa_family_t  af;
u_int8_t proto;
-   u_int8_t type;
-   u_int8_t code;
+   u_int16_ttype;  /* aux. value 256 is legit */
+   u_int16_tcode;  /* aux. value 256 is legit */
u_int8_t flags;
u_int8_t flagset;
u_int8_t min_ttl;
@@ -592,7 +592,6 @@ struct pf_rule {
u_int8_t set_prio[2];
sa_family_t  naf;
u_int8_t rcvifnot;
-   u_int8_t pad[2];
 
struct {
struct pf_addr  addr;



Re: pf max-src-{states,conn} without overload/flush useless?

2023-02-08 Thread Alexandr Nedvedicky
Hello,

I did test similar rules on my system
OpenBSD 7.2-current (GENERIC.MP) #978: Sun Jan 22 11:41:04 MST 2023

these are my rules:

set skip on lo

block return# block stateless traffic
pass out log# establish keep-state
pass in on iwn0 proto tcp from 192.168.2.0/24 to self port 22 \
keep state (source-track rule, max-src-conn 3)

this is my test terminal on remote host:
router$ for i in `seq 5` ; do nc 192.168.2.175 22 &  done 
[1] 32472
[2] 97453
[3] 7192
[4] 50386
[5] 57517
router$ SSH-2.0-OpenSSH_9.1
SSH-2.0-OpenSSH_9.1
SSH-2.0-OpenSSH_9.1

there are three SSH version strings which indicates
I got 3 working connection of 5 attempts.

I'm not sure why it seems to work for me (ssh) while
it is broken for you (http). I'll give it a try with
webserver.

On Wed, Feb 08, 2023 at 04:34:55PM -0600, joshua stein wrote:



> diff --git sys/net/pf.c sys/net/pf.c
> index 8cb1326a160..89703feab12 100644
> --- sys/net/pf.c
> +++ sys/net/pf.c
> @@ -481,12 +481,10 @@ pf_src_connlimit(struct pf_state **stp)
>   if ((sn = pf_get_src_node((*stp), PF_SN_NONE)) == NULL)
>   return (0);
>  
> - sn->conn++;
> - (*stp)->src.tcp_est = 1;
>   pf_add_threshold(>conn_rate);
>  
>   if ((*stp)->rule.ptr->max_src_conn &&
> - (*stp)->rule.ptr->max_src_conn < sn->conn) {
> + sn->conn >= (*stp)->rule.ptr->max_src_conn) {
>   pf_status.lcounters[LCNT_SRCCONN]++;
>   bad++;
>   }
> @@ -497,8 +495,11 @@ pf_src_connlimit(struct pf_state **stp)
>   bad++;
>   }
>  
> - if (!bad)
> + if (!bad) {
> + sn->conn++;
> + (*stp)->src.tcp_est = 1;
>   return (0);
> + }
>  
>   if ((*stp)->rule.ptr->overload_tbl) {
>   struct pfr_addr p;

it seems to me the change to pf_src_connlimit() does
not alter behavior. I think change to pf_src_connlimit()
can be dropped.

> @@ -4919,14 +4920,14 @@ pf_tcp_track_full(struct pf_pdesc *pd, struct 
> pf_state **stp, u_short *reason,
>   pf_set_protostate(*stp, psrc, TCPS_CLOSING);
>   if (th->th_flags & TH_ACK) {
>   if (dst->state == TCPS_SYN_SENT) {
> - pf_set_protostate(*stp, pdst,
> - TCPS_ESTABLISHED);
> - if (src->state == TCPS_ESTABLISHED &&
> + if (src->state == TCPS_SYN_SENT &&
>   !SLIST_EMPTY(&(*stp)->src_nodes) &&
>   pf_src_connlimit(stp)) {
>   REASON_SET(reason, PFRES_SRCLIMIT);
>   return (PF_DROP);
>   }
> + pf_set_protostate(*stp, pdst,
> + TCPS_ESTABLISHED);
>   } else if (dst->state == TCPS_CLOSING)
>   pf_set_protostate(*stp, pdst,
>   TCPS_FIN_WAIT_2);
> 

If I understand things right then in current pf we check conn limit
when we see acknowledgment of 3rd client's ACK sent by server. Not sure
if I sound clear here so let me draw little diagram:

SYN > sent by client moves state to

TCPS_SYN_SENT : TCPS_LISTEN/TCPS_CLOSED (1)

SYN | ACK < sent by server moves state to

TCPS_SYN_SENT : TCPS_SYN_SENT (2)

ACK > 3rd ack sent by client move state to:

TCPS_ESTABLISHED : TCPS_SYN_SENT (3)

ACK <  server acknowledges client's 3rd ack moves state to

TCPS_ESTABLISHED : TCPS_ESTABLISHED (4)

currently we do conn limit check in step (4). Your change moves this
to earlier step (3) (given I understand things right here).
It's awfully early here I need sleep on this.

can you give it a try with your slightly modified diff? just drop
changes to pf_src_connlimit() and keep those in pf_tcp_track_full() which
I believe is the only relevant part.

thanks and
regards
sashan



Re: pf max-src-{states,conn} without overload/flush useless?

2023-02-09 Thread Alexandr Nedvedicky
Hello,

On Wed, Feb 08, 2023 at 09:42:11PM -0600, joshua stein wrote:

> $ for i in `seq 5` ; do nc 192.168.1.240 22 &  done
> [2] 68892
> [3] 6303
> [4] 63554
> [5] 87833
> [6] 49997
> $ SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> 
> vm:~$ doas pfctl -sr
> block return all
> pass out all flags S/SA
> pass in on egress inet6 proto tcp from any to ::1 port = 22 flags S/SA keep 
> state (source-track rule, max-src-conn 3)
> pass in on egress inet proto tcp from any to 127.0.0.1 port = 22 flags S/SA 
> keep state (source-track rule, max-src-conn 3)
> pass in on egress inet proto tcp from any to 192.168.1.240 port = 22 flags 
> S/SA keep state (source-track rule, max-src-conn 3)
> 
> This is with:
> 
> OpenBSD 7.2-current (GENERIC.MP) #2014: Tue Feb  7 16:24:04 MST 2023
> dera...@arm64.openbsd.org:/usr/src/sys/arch/arm64/compile/GENERIC.MP


I gave it a try after doing a sysupgrade to:

penBSD 7.2-current (GENERIC.MP) #1025: Wed Feb  8 19:16:09 MST 2023

it still works for me as expected:
disk$ for i in `seq 5` ; do nc 192.168.2.175 22 & done
[1] 51566
[2] 78983
[3] 77864
[4] 37474
[5] 98599
disk$ SSH-2.0-OpenSSH_9.2
SSH-2.0-OpenSSH_9.2
SSH-2.0-OpenSSH_9.2

my connection arrives over iwn0 interface which is in egress group
so our environments are almost identical.



> 
> > > diff --git sys/net/pf.c sys/net/pf.c
> > > index 8cb1326a160..89703feab12 100644
> > > --- sys/net/pf.c
> > > +++ sys/net/pf.c
> > > @@ -481,12 +481,10 @@ pf_src_connlimit(struct pf_state **stp)
> > >   if ((sn = pf_get_src_node((*stp), PF_SN_NONE)) == NULL)
> > >   return (0);
> > >  
> > > - sn->conn++;
> > > - (*stp)->src.tcp_est = 1;
> > >   pf_add_threshold(>conn_rate);
> > >  
> > >   if ((*stp)->rule.ptr->max_src_conn &&
> > > - (*stp)->rule.ptr->max_src_conn < sn->conn) {
> > > + sn->conn >= (*stp)->rule.ptr->max_src_conn) {
> > >   pf_status.lcounters[LCNT_SRCCONN]++;
> > >   bad++;
> > >   }
> > > @@ -497,8 +495,11 @@ pf_src_connlimit(struct pf_state **stp)
> > >   bad++;
> > >   }
> > >  
> > > - if (!bad)
> > > + if (!bad) {
> > > + sn->conn++;
> > > + (*stp)->src.tcp_est = 1;
> > >   return (0);
> > > + }
> > >  
> > >   if ((*stp)->rule.ptr->overload_tbl) {
> > >   struct pfr_addr p;
> > 
> > it seems to me the change to pf_src_connlimit() does
> > not alter behavior. I think change to pf_src_connlimit()
> > can be dropped.
> 
> But don't we not want to increment the source node's connection 
> count since we're not going to accept the connection (in the !bad 
> case)?  I'm not sure what kind of bookkeeping that would screw up.
> 

what we currently do is we always bump connection count
for source node we found. then we are going to check limit
(*stp)->rule.ptr->max_src_conn < sn->conn
if the limit is exceeded we mark as closed and expired (timeout
PFTM_PURGE). We also report that to caller which should close the
connection.

your change stops counting connections as soon as limit is
reached. So now I see there is a change in behavior. I've missed
that yesterday. I'm not able to tell if we want go that way or not.


> > currently we do conn limit check in step (4). Your change moves this
> > to earlier step (3) (given I understand things right here).
> > It's awfully early here I need sleep on this.
> 
> Yes, that was my understanding too.  We wait until the remote has 
> done enough work to be a valid connection but then block it before 
> sending the final ack.
> 
> > can you give it a try with your slightly modified diff? just drop
> > changes to pf_src_connlimit() and keep those in pf_tcp_track_full() which
> > I believe is the only relevant part.
> 
> Yes, it still works, and only allows me 3 connections with the final 
> 2 timing out as expected:
> 
> $ for i in `seq 5` ; do nc 192.168.1.240 22 &  done
> [2] 10193
> [3] 30197
> [4] 72235
> [5] 69900
> [6] 99044
> $ SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> SSH-2.0-OpenSSH_9.1
> [5]  - exit 1 nc 192.168.1.240 22
> $
> [6]  + exit 1 nc 192.168.1.240 22
> 

the only explanation why it does not work for you is latency. The packets which
match state run as a readers.  so if we call pf_set_protostate() before we
actually check the limit then it might be the cause here. I admit it's
rather a speculation.

can you give a try to diff below?

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/sys/net/pf.c b/sys/net/pf.c
index 8cb1326a160..f81b0c793ce 100644
--- a/sys/net/pf.c
+++ b/sys/net/pf.c
@@ -4919,14 +4919,14 @@ pf_tcp_track_full(struct pf_pdesc *pd, struct pf_state 
**stp, u_short *reason,
pf_set_protostate(*stp, psrc, TCPS_CLOSING);
if (th->th_flags & TH_ACK) {
if (dst->state == 

Re: bpf(4) "wait" timeouts

2023-02-12 Thread Alexandr Nedvedicky
Hello,

This diff looks good to me. Though I still have some
doubts about accuracy of comment here:


>   return (kn->kn_data > 0);
> @@ -1510,6 +1599,15 @@ bpf_catchpacket(struct bpf_d *d, u_char 
>   ++d->bd_dcount;
>   return;
>   }
> +
> + /*
> +  * there's a small chance bpf_wait_cb is running
> +  * concurrently with this and two wakeups will be
> +  * generated.
> +  */
> + if (timeout_del(>bd_wait_tmo))
> + bpf_put(d);
> +
>   ROTATE_BUFFERS(d);
>   do_wakeup = 1;
>   curlen = 0;
> @@ -1530,12 +1628,27 @@ bpf_catchpacket(struct bpf_d *d, u_char 

I'm still failing to spot the race the comment is talking
about. code in bpf_wait_cb() grabs `d->bd_mtx` mutex, the
same mutex which is held by bpf_catchpacket() caller.

However my doubts about comment should not prevent you 
committing code which solves the issue and looks good.

OK sashan



fix NULL pointer dereference in pfsync_bulk_update()

2023-02-12 Thread Alexandr Nedvedicky
Hello,

this bug has been found and reported by Hrvoje@ [1].
I took my chance and asked Hrvoje to test a small diff [2].
I would like to ask for OK to commit this fix which makes
Hrvoje's test box happy. Diff below is same to one found
at bugs@. The thing is that pfsync_bulk_update() function
must check first if there is anything to update, otherwise
we may day due to NULL pointer dereference.

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-bugs=167578573111413=2

[2] https://marc.info/?l=openbsd-bugs=167584283809140=2

8<---8<---8<--8<
diff --git a/sys/net/if_pfsync.c b/sys/net/if_pfsync.c
index e2c86971336..1fa58f6fab9 100644
--- a/sys/net/if_pfsync.c
+++ b/sys/net/if_pfsync.c
@@ -2464,6 +2464,11 @@ pfsync_bulk_update(void *arg)
st = sc->sc_bulk_next;
sc->sc_bulk_next = NULL;
 
+   if (st == NULL) {
+   rw_exit_read(_state_list.pfs_rwl);
+   goto out;
+   }
+
for (;;) {
if (st->sync_state == PFSYNC_S_NONE &&
st->timeout < PFTM_MAX &&



Re: fix NULL pointer dereference in pfsync_bulk_update()

2023-02-14 Thread Alexandr Nedvedicky
Hello,

On Tue, Feb 14, 2023 at 01:23:16AM +0100, Alexander Bluhm wrote:
> On Mon, Feb 13, 2023 at 08:39:39AM +0100, Alexandr Nedvedicky wrote:
> > this bug has been found and reported by Hrvoje@ [1].
> > I took my chance and asked Hrvoje to test a small diff [2].
> > I would like to ask for OK to commit this fix which makes
> > Hrvoje's test box happy. Diff below is same to one found
> > at bugs@. The thing is that pfsync_bulk_update() function
> > must check first if there is anything to update, otherwise
> > we may day due to NULL pointer dereference.
> 
> Your check makes sense, OK bluhm@
> 
> But who is protecting write access to sc->sc_bulk_next ?
> 
> I think it is exclusive netlock.  This works as pfsync_input() is
> a protocol input function which is still not running in parallel.

yes, NET_LOCK() is still exclusive lock. so it should provide
sufficient protection.

> 
> rw_enter_read(_state_list.pfs_rwl) does not protect sc->sc_bulk_next
> it is a read lock.  mtx_enter(_state_list.pfs_mtx) is not taken
> for all writers to sc->sc_bulk_next.
> 
> Do you have plans to relax this code from exclusive to shared
> netlock?

dlg@ has a new code for pfsync. I've seen his code back in ?December?
last year. support for bulk transfers was missing piece.

I think option we have here in pfsync(4) is to introduce a new rw-lock
private to if_pfsync and remove NET_LOCK() here completely.

sashan



Re: DIOCGETRULE is slow for large rulesets (complete diff)

2023-04-28 Thread Alexandr Nedvedicky
Hello,

On Fri, Apr 28, 2023 at 11:18:18AM +, Klemens Nanni wrote:
> On Fri, Apr 28, 2023 at 12:55:37PM +0200, Alexandr Nedvedicky wrote:
> > Hello,
> > [ sending off-list ]
> > 
> > up-to-date complete diff.
> 
> Builds, boots and works faster:
> 
>   $ time jot -w 'pass proto tcp to port ' $((20 * 1000)) 1024 | pfctl 
> -vnf- -onone
>   ...
>   0m02.62s real 0m00.55s user 0m02.06s system
> 
> Before it took about 10s real time.
> 
> OK kn, nits inline.
> 

will commit the whole diff.


> > @@ -968,6 +953,13 @@ pfctl_show_rules(int dev, char *path, int opts, enum 
> > pfctl_show format,
> > case PFCTL_SHOW_NOTHING:
> > break;
> > }
> > +   errno = 0;
> > +   }
> > +
> > +   if (errno != 0 && errno != ENOENT) {
> 
> I'd drop ' != 0', it is an int and almost always checked as boolean like this.
> 

I'll keep code as-is. Just to keep both checks
in consistent shape. I don't like appearance of

if (errno && errno != ENOENT)


I'll address all other nits pointed out by Klemens.

thanks and
regards
sashan



Re: huge pfsync rewrite

2023-07-03 Thread Alexandr Nedvedicky
Hello,

I went through the recent diff one more time. I could not spot
anything wrong. Also my home router was happy with it for
quite some time. Unfortunately I'm not using pfsync.
However I'm sure hrvoje@ done his best to try to make it
to crash and no luck diff survived.

Having said earlier it would be more risky if dlg@ will slice
this chunk to smaller diffs it is the best to commit the
whole change.


OK sashan



Re: pf(4) should mention DIOCXEND

2023-07-05 Thread Alexandr Nedvedicky
Hello,


On Wed, Jul 05, 2023 at 11:10:11AM +0200, Alexandr Nedvedicky wrote:

> 
> thanks for your help to put my update to pf(4) to shape.
> updated diff is below.
> 

diff in my earlier email was wrong. this one is the right one.

sorry for extra noise.

regards
sashan

8<---8<---8<--8<
diff --git a/share/man/man4/pf.4 b/share/man/man4/pf.4
index 92eeb45f657..7346c7e3194 100644
--- a/share/man/man4/pf.4
+++ b/share/man/man4/pf.4
@@ -48,12 +48,25 @@ and retrieve statistics.
 The most commonly used functions are covered by
 .Xr pfctl 8 .
 .Pp
-Manipulations like loading a ruleset that involve more than a single
+Operations loading or reading a ruleset that involve more than a single
 .Xr ioctl 2
 call require a so-called
-.Em ticket ,
-which prevents the occurrence of
-multiple concurrent manipulations.
+.Sy ticket ,
+which allows
+.Xr pf 4
+to deal with concurrent operations.
+For certain
+.Xr ioctl 2
+commands (currently
+.Dv DIOCGETRULES )
+the number of tickets program can get is limited.
+The program must explicitly release the ticket using the
+.Dv DIOCXEND
+command to avoid hitting the limit.
+All tickets which are not freed by
+.Dv DIOCXEND
+are released when the program closes
+.Pa /dev/pf .
 .Pp
 Fields of
 .Xr ioctl 2
@@ -132,6 +145,9 @@ for subsequent
 calls and the number
 .Va nr
 of rules in the active ruleset.
+The ticket should be released by the
+.Dv DIOCXEND
+command.
 .It Dv DIOCGETRULE Fa "struct pfioc_rule *pr"
 Get a
 .Va rule
@@ -792,6 +808,10 @@ inactive rulesets since the last
 .Dv DIOCXBEGIN .
 .Dv DIOCXROLLBACK
 will silently ignore rulesets for which the ticket is invalid.
+.It Dv DIOCXEND Fa "u_int32_t *ticket"
+Release ticket obtained by the
+.Dv DIOCGETRULES
+command.
 .It Dv DIOCSETHOSTID Fa "u_int32_t *hostid"
 Set the host ID, which is used by
 .Xr pfsync 4



Re: pf(4) should mention DIOCXEND

2023-07-05 Thread Alexandr Nedvedicky
Hello,

diff below includes suggestions from jmc@ and kn@
I'll commit it later today.

thanks and
regards
sashan

8<---8<---8<--8<
diff --git a/share/man/man4/pf.4 b/share/man/man4/pf.4
index 92eeb45f657..e9d4231fe97 100644
--- a/share/man/man4/pf.4
+++ b/share/man/man4/pf.4
@@ -48,12 +48,25 @@ and retrieve statistics.
 The most commonly used functions are covered by
 .Xr pfctl 8 .
 .Pp
-Manipulations like loading a ruleset that involve more than a single
+Operations loading or reading a ruleset that involve more than a single
 .Xr ioctl 2
 call require a so-called
-.Em ticket ,
-which prevents the occurrence of
-multiple concurrent manipulations.
+.Sy ticket ,
+which allows
+.Nm
+to deal with concurrent operations.
+For certain
+.Xr ioctl 2
+commands (currently
+.Dv DIOCGETRULES )
+the number of tickets a program can get is limited.
+The programs must explicitly release their tickets using the
+.Dv DIOCXEND
+command to avoid hitting the limit.
+All tickets which are not freed by
+.Dv DIOCXEND
+are released when the program closes
+.Pa /dev/pf .
 .Pp
 Fields of
 .Xr ioctl 2
@@ -132,6 +145,9 @@ for subsequent
 calls and the number
 .Va nr
 of rules in the active ruleset.
+The ticket should be released by the
+.Dv DIOCXEND
+command.
 .It Dv DIOCGETRULE Fa "struct pfioc_rule *pr"
 Get a
 .Va rule
@@ -792,6 +808,10 @@ inactive rulesets since the last
 .Dv DIOCXBEGIN .
 .Dv DIOCXROLLBACK
 will silently ignore rulesets for which the ticket is invalid.
+.It Dv DIOCXEND Fa "u_int32_t *ticket"
+Release the ticket obtained by the
+.Dv DIOCGETRULES
+command.
 .It Dv DIOCSETHOSTID Fa "u_int32_t *hostid"
 Set the host ID, which is used by
 .Xr pfsync 4



Re: pf(4) should mention DIOCXEND

2023-07-05 Thread Alexandr Nedvedicky
Hello Jason,

thank you for taking a look. More comments in-line.
On Tue, Jul 04, 2023 at 09:03:29PM +0100, Jason McIntyre wrote:

> > @@ -48,12 +48,25 @@ and retrieve statistics.
> >  The most commonly used functions are covered by
> >  .Xr pfctl 8 .
> >  .Pp
> > -Manipulations like loading a ruleset that involve more than a single
> > +Operations like loading or reading a ruleset that involve more than a 
> > single
> 
> you probably don;t need to add "or reading", since you already indicate
> that it is just an example ("like"), not an exhaustive list. or is there
> a specific reason to list reading a ruleset?

I opted for: 'Operations loading or reading a ruleset...'
the thing is I do really want 'reading' to appear there.
previously we did not need to do anything after reading
of rules was done. It's different now, because in some
cases (snmpd et. al.) we must call DIOCXEND on ticket.

> 
> >  .Xr ioctl 2
> >  call require a so-called
> >  .Em ticket ,
> 
> should probably be Sy rather than Em, but don;t sweat it if such a
> change would make the rest of the manual inconsistent.

looks like we can go for 'Sy' here, because this was the
only occurrence of '.Em ticket'. Later we use '.Va ticket',
but this is in context where ioctl(2) operations are described.
> 
> > -which prevents the occurrence of
> > -multiple concurrent manipulations.
> > +which allows
> > +.Xr pf 4
> > +to deal with concurrent operations.
> > +For certain
> > +.Xr ioctl 2
> > +commands (currently
> > +.Dv DIOCGETRULES )
> > +the number of tickets application can obtain is limited.
> 
> i'm not sure what this means. tickets per application? "tickets
> application" does not read correctly.

may be if I say:
'the number of tickets each program can obtain is limited.'
what I want to say is there is a limit on tickets which
each program can get DIOCGETRULES. To avoid hitting the limit
program must use DIOCXEND to return ticket back to system
when multi-ioctl operation is done.
> 
> > +The application must explicitly release the ticket using
> 
> s/using/using the/

will use 'using the'

> > +.Dv DIOCXEND
> > +command to avoid hitting the limit.
> > +All tickets which are not freed by
> > +.Dv DIOCXEND
> > +are released when application closes
> 
> s/application/the application/

opting for 'the program' to be consistent with new update.

> 
> > +.Pa /dev/pf .
> >  .Pp
> >  Fields of
> >  .Xr ioctl 2
> > @@ -132,6 +145,9 @@ for subsequent
> >  calls and the number
> >  .Va nr
> >  of rules in the active ruleset.
> > +The ticket should be released by
> 
> s/by/by the/
> 

opting for 'released by the'

> or maybe just "released by DIOCXEND".
> 
> > +.Dv DIOCXEND
> > +command.
> >  .It Dv DIOCGETRULE Fa "struct pfioc_rule *pr"
> >  Get a
> >  .Va rule
> > @@ -792,6 +808,10 @@ inactive rulesets since the last
> >  .Dv DIOCXBEGIN .
> >  .Dv DIOCXROLLBACK
> >  will silently ignore rulesets for which the ticket is invalid.
> > +.It Dv DIOCXEND Fa "u_int32_t *ticket"
> > +Release ticket obtained by
> > +.Dv DIOCGETRULES
> > +command.
> 
> again, either "by the XXX command" or "by XXX".

going for 'by the XXX command'

thanks for your help to put my update to pf(4) to shape.
updated diff is below.

regards
sashan

8<---8<---8<--8<
diff --git a/lib/libcrypto/ecdsa/ecs_ossl.c b/lib/libcrypto/ecdsa/ecs_ossl.c
index 0ca2651f255..4bc77a49204 100644
--- a/lib/libcrypto/ecdsa/ecs_ossl.c
+++ b/lib/libcrypto/ecdsa/ecs_ossl.c
@@ -1,4 +1,4 @@
-/* $OpenBSD: ecs_ossl.c,v 1.71 2023/07/04 15:09:31 tb Exp $ */
+/* $OpenBSD: ecs_ossl.c,v 1.68 2023/07/04 10:53:42 tb Exp $ */
 /*
  * Written by Nils Larsch for the OpenSSL project
  */
@@ -122,11 +122,6 @@ ossl_ecdsa_sign(int type, const unsigned char *digest, int 
digest_len,
return ret;
 }
 
-/*
- * FIPS 186-5, section 6.4.1, steps 3-8 and 11: Generate k, calculate r and
- * kinv, and clear it. If r == 0, try again with a new random k.
- */
-
 int
 ossl_ecdsa_sign_setup(EC_KEY *key, BN_CTX *in_ctx, BIGNUM **out_kinv,
 BIGNUM **out_r)
@@ -198,9 +193,7 @@ ossl_ecdsa_sign_setup(EC_KEY *key, BN_CTX *in_ctx, BIGNUM 
**out_kinv,
!BN_set_bit(x, order_bits))
goto err;
 
-   /* Step 11: repeat until r != 0. */
do {
-   /* Step 3: generate random k. */
if (!bn_rand_interval(k, BN_value_one(), order)) {
ECDSAerror(ECDSA_R_RANDOM_NUMBER_GENERATION_FAILED);
goto err;
@@ -227,25 +220,22 @@ ossl_ecdsa_sign_setup(EC_KEY *key, BN_CTX *in_ctx, BIGNUM 
**out_kinv,
 
BN_set_flags(k, BN_FLG_CONSTTIME);
 
-   /* Step 5: P = k * G. */
+   /* Compute r, the x-coordinate of G * k. */
if (!EC_POINT_mul(group, point, k, NULL, NULL, ctx)) {
ECDSAerror(ERR_R_EC_LIB);
goto err;

pf(4) should mention DIOCXEND

2023-07-04 Thread Alexandr Nedvedicky
Hello,

diff below updates pf(4) manpage to reflect changes [1] which
were committed earlier today.

does update to pf(4) read OK?

thanks and
regards
sashan

[1] https://marc.info/?l=openbsd-cvs=168848058603797=2
https://marc.info/?l=openbsd-cvs=168847042626997=2

8<---8<---8<--8<

diff --git a/share/man/man4/pf.4 b/share/man/man4/pf.4
index 92eeb45f657..305c536b137 100644
--- a/share/man/man4/pf.4
+++ b/share/man/man4/pf.4
@@ -48,12 +48,25 @@ and retrieve statistics.
 The most commonly used functions are covered by
 .Xr pfctl 8 .
 .Pp
-Manipulations like loading a ruleset that involve more than a single
+Operations like loading or reading a ruleset that involve more than a single
 .Xr ioctl 2
 call require a so-called
 .Em ticket ,
-which prevents the occurrence of
-multiple concurrent manipulations.
+which allows
+.Xr pf 4
+to deal with concurrent operations.
+For certain
+.Xr ioctl 2
+commands (currently
+.Dv DIOCGETRULES )
+the number of tickets application can obtain is limited.
+The application must explicitly release the ticket using
+.Dv DIOCXEND
+command to avoid hitting the limit.
+All tickets which are not freed by
+.Dv DIOCXEND
+are released when application closes
+.Pa /dev/pf .
 .Pp
 Fields of
 .Xr ioctl 2
@@ -132,6 +145,9 @@ for subsequent
 calls and the number
 .Va nr
 of rules in the active ruleset.
+The ticket should be released by
+.Dv DIOCXEND
+command.
 .It Dv DIOCGETRULE Fa "struct pfioc_rule *pr"
 Get a
 .Va rule
@@ -792,6 +808,10 @@ inactive rulesets since the last
 .Dv DIOCXBEGIN .
 .Dv DIOCXROLLBACK
 will silently ignore rulesets for which the ticket is invalid.
+.It Dv DIOCXEND Fa "u_int32_t *ticket"
+Release ticket obtained by
+.Dv DIOCGETRULES
+command.
 .It Dv DIOCSETHOSTID Fa "u_int32_t *hostid"
 Set the host ID, which is used by
 .Xr pfsync 4



Re: pfioctl: drop net lock from SIOC{S,G}LIMIT

2023-05-25 Thread Alexandr Nedvedicky
Hello,

On Thu, May 25, 2023 at 07:14:54AM +, Klemens Nanni wrote:
> On Thu, May 25, 2023 at 03:28:45AM +, Klemens Nanni wrote:
> > On Thu, May 25, 2023 at 03:20:04AM +, Klemens Nanni wrote:
> > > pfsync_in_bus() looks like the only place where the static array
> > > pf_pool_limits[] is accessed without the pf lock, so grab it there.
> > > 
> > > Limits themselves are protected by the pf lock and pool(9)s are never
> > > destroyed and have builtint per-pool locks, so the net lock is not
> > > needed.
> > > 
> > > (pf_pool_limits[] access in DIOCXCOMMIT remains under pf *and net* lock
> > >  until the rest in there gets pulled out of the net lock.)
> > > 
> > > Feedback? OK?
> > 
> > Correct diff without typo and with missing locking comment.
> 
> Diffing pfvar.h instead of pfvar_priv.h helps to get the comment hunk...

looks good to me.

OK sashan



pf(4) may cause relayd(8) to abort

2023-07-31 Thread Alexandr Nedvedicky
Hello,

the issue has been reported by Gianni Kapetanakis month ago [1]. It took
several emails to figure out relayd(8) exists after hosts got disabled
by 'relayctl host dis ...'

The thing is that relayd(8) relies on pf(4) to create persistent
tables (PFR_TFLAG_PERSIST) as relayd requests that:

 47 void
 48 init_tables(struct relayd *env)
 49 {
 ...
 62 TAILQ_FOREACH(rdr, env->sc_rdrs, entry) {
 63 if (strlcpy(tables[i].pfrt_anchor, RELAYD_ANCHOR "/",
 64 sizeof(tables[i].pfrt_anchor)) >= PF_ANCHOR_NAME_SIZE)
 65 goto toolong;
 66 if (strlcat(tables[i].pfrt_anchor, rdr->conf.name,
 67 sizeof(tables[i].pfrt_anchor)) >= PF_ANCHOR_NAME_SIZE)
 68 goto toolong;
 69 if (strlcpy(tables[i].pfrt_name, rdr->conf.name,
 70 sizeof(tables[i].pfrt_name)) >=
 71 sizeof(tables[i].pfrt_name))
 72 goto toolong;
 73 tables[i].pfrt_flags |= PFR_TFLAG_PERSIST;
 74 i++;
 75 }

unfortunately it's not the case as further investigation revealed [2].

the issue can be easily reproduced by pfctl(8) which also creates
persistent tables on behalf of command line:

pfctl -t foo -T add ...

command above always asks pf(4) to create persistent table, however
pf(4) does not honor persistent flag when  table exists already.
One can verify that using commands as follows:

## create 'referenced' table only (table exists but has no active flag)
# echo 'pass from in  to any' |pfctl -f -
# pfctl -sT -vg
r-- foo
# create instance of table  using command line:
# pfctl -t foo -T add 192.168.1.0/24
1/1 addresses added.
# pfctl -sT -vg
--a-r-- foo
## create instance of table , note the table will get 'p' flag
# pfctl -t bar -T add 192.168.10.0/24
1 table created.
1/1 addresses added.
# pfctl -sT -vg
-pa bar
--a-r-- foo

one-liner change to sys/net/pf_table.c fixes that it also works for Gianni
Kapetanakis. I'm also adding tests to regress/sys/net/pf_table/Makefile
to cover it.

On system which runs current the test fails with error as follows:

pfctl -a regress/ttest -t instance -T add 192.168.1.0/24
1/1 addresses added.
pfctl -a regress/ttest -sT -vg | diff table-persist.out -
1c1
< -pa-r--   instanceregress/ttest
---
> --a-r--   instanceregress/ttest
*** Error 1 in . (Makefile:96 'flags')
FAILED

the failure is expected on system without patch. On system with
patch applied all tests do pass.

OK to commit?

thanks and
regards
sashan


[1] https://marc.info/?t=16881127045=1=2

[2] https://marc.info/?l=openbsd-bugs=168868165801905=2

8<---8<---8<--8<
diff --git a/sys/net/pf_table.c b/sys/net/pf_table.c
index 6f23a6f795d..c862c804f84 100644
--- a/sys/net/pf_table.c
+++ b/sys/net/pf_table.c
@@ -1565,8 +1565,10 @@ pfr_add_tables(struct pfr_table *tbl, int size, int 
*nadd, int flags)
xadd++;
} else if (!(flags & PFR_FLAG_DUMMY) &&
!(p->pfrkt_flags & PFR_TFLAG_ACTIVE)) {
-   p->pfrkt_nflags = (p->pfrkt_flags &
-   ~PFR_TFLAG_USRMASK) | PFR_TFLAG_ACTIVE;
+   p->pfrkt_nflags =
+   (p->pfrkt_flags & ~PFR_TFLAG_USRMASK) |
+   (n->pfrkt_flags & PFR_TFLAG_USRMASK) |
+   PFR_TFLAG_ACTIVE;
SLIST_INSERT_HEAD(, p, pfrkt_workq);
}
}
diff --git a/regress/sys/net/pf_table/Makefile 
b/regress/sys/net/pf_table/Makefile
index a71f0190c73..8911e8a1d35 100644
--- a/regress/sys/net/pf_table/Makefile
+++ b/regress/sys/net/pf_table/Makefile
@@ -1,15 +1,26 @@
 #  $OpenBSD: Makefile,v 1.3 2017/07/07 23:15:27 bluhm Exp $
 
-REGRESS_TARGETS=   hit miss cleanup
-CLEANFILES=stamp-*
+REGRESS_TARGETS=   hit miss cleanup flags
+CLEANFILES=stamp-* \
+   pf-reftab.conf  \
+   pf-instance.conf\
+   table-ref.conf  \
+   table-pgone.out \
+   table-persist.out   \
+   table-ref.out   \
+   table-refgone.out
+
 
 stamp-setup:
+   ${SUDO} pfctl -a regress/ttest -Fa
${SUDO} pfctl -qt __regress_tbl -T add -f ${.CURDIR}/table.in
date >$@
 
 cleanup:
rm -f stamp-setup
${SUDO} pfctl -qt __regress_tbl -T kill
+   ${SUDO} pfctl -q -a regress/ttest -Fr
+   ${SUDO} pfctl -q -a regress/ttest -qt instance -T kill
 
 hit: stamp-setup
for i in `cat ${.CURDIR}/table.hit`; do \
@@ -27,6 +38,77 @@ miss: stamp-setup
done; \
exit 0
 
-.PHONY: hit 

<    1   2   3   4   5   6   7   >