Re: the very first step towards MULTIPROCESSOR friendly PF
Hello, please do not use the patch I've sent last Friday. It's wrong. It might work, but it's a dead end in evolution. The thing is I misunderstood (read: ignored) all the constraints behind packets and interrupts in OpenBSD kernel. see mikeb's email below. please do not wast your cycles with the stuff from last Friday. regards sasha cut here to get mike's email... 8<---8<-8< Hi, We will certainly discuss it next week, but I think this is a wrong direction. rw-locks don't mix with interrupts at all, grabbing locks for every packet should also be avoided if possible. You have mentioned deferred rule gc that you've implemented. perhaps this is more important to do right now. Get rid of as much memory operations during packet processing as possible. This is something that can be committed early. Cheers, Mike
Re: the very first step towards MULTIPROCESSOR friendly PF
Hello, after reading emails from Philip Guenther and Mark Kettenis, doing some RTFM on locking in OpenBSD kernel I have a new patch. I call it as a smp-step-0. Patch introduces a KERNEL_LOCK() to PF. Many dances with KERNEL_LOCK() happens in pf_test(). From future work point of view there are distinct parts as follows in pf_test(): - packet sanitization with reassembly. We grab and drop KERNEL_LOCK() pf_noralize_ip*() functions around pf_reassemble*(). Fragment cache will get its own rw-lock later. - as soon as packet is sanitized (or should say normalized), we do state check. State check will get its own rw-lock (I call it smp-step-1) - if no state is found PF proceeds to rule check, which will get its rw-lock too (smp-step-2). Basically KERNEL_LOCK() is placed everywhere, where individual rw-locks will be introduced later. As Philip Guenther has pointed out earlier in email, the user-threads must grab KERNEL_LOCK() before they will jump to rw-locks. The game is easy for PF in ioctl() subsystem, since everything there will be user thread. For packets (pf_test()) function the game is not that clear, since not all packets will be running in user-thread context. My assumption is the packets bound to loopback and local outbound packets are running in user-threads. All other packets (inbound and forwarded) are running in interrupt-threads. For this reason I'd like to introduce the extra code to pf_test(), which decides whether packet requires KERNEL_LOCK(), because it is running as a user thread. I'd like to get the patch in, before proceeding to next SMP step. thanks and regards sasha 8<---8<---8<--8< Index: net/pf.c === RCS file: /cvs/src/sys/net/pf.c,v retrieving revision 1.937 diff -u -p -r1.937 pf.c --- net/pf.c1 Sep 2015 19:12:25 - 1.937 +++ net/pf.c4 Sep 2015 12:52:55 - @@ -6316,6 +6316,7 @@ pf_test(sa_family_t af, int fwdir, struc union pf_headers pdhdrs; int dir = (fwdir == PF_FWD) ? PF_OUT : fwdir; u_int32_tqid, pqid = 0; + int have_lock = 0; if (!pf_status.running) return (PF_PASS); @@ -6349,6 +6350,26 @@ pf_test(sa_family_t af, int fwdir, struc return (PF_PASS); } + /* +* Unlike kernel-threads, the user-threads require to have +* KENREL_LOCK() before they will touch rw-locks. +* +* The assumption here is fairly simple: +* - all inbound packets run as interrupt (kernel-thread), +*unless they are bound to loopback +* +* - outbound packets run as kernel-thread, if they +*are marked by PF_TAG_KERNEL flag. +* +* in all other cases we must assume packet runs in user-thread +* context and we should get KERNEL_LOCK(). +*/ + if ((dir == PF_IN) && (ifp->if_type != IFT_LOOP)) { + (*m0)->m_pkthdr.pf.flags |= PF_TAG_KERNEL; + } else if (((*m0)->m_pkthdr.pf.flags & PF_TAG_KERNEL) == 0) { + PF_LOCK(have_lock); + } + action = pf_setup_pdesc(&pd, &pdhdrs, af, dir, kif, *m0, &reason); if (action != PF_PASS) { #if NPFLOG > 0 @@ -6399,6 +6420,7 @@ pf_test(sa_family_t af, int fwdir, struc * handle fragments that aren't reassembled by * normalization */ + PF_LOCK(have_lock); action = pf_test_rule(&pd, &r, &s, &a, &ruleset); if (action != PF_PASS) REASON_SET(&reason, PFRES_FRAG); @@ -6413,6 +6435,7 @@ pf_test(sa_family_t af, int fwdir, struc "dropping IPv6 packet with ICMPv4 payload"); goto done; } + PF_LOCK(have_lock); action = pf_test_state_icmp(&pd, &s, &reason); if (action == PF_PASS || action == PF_AFRT) { #if NPFSYNC > 0 @@ -6437,6 +6460,7 @@ pf_test(sa_family_t af, int fwdir, struc "dropping IPv4 packet with ICMPv6 payload"); goto done; } + PF_LOCK(have_lock); action = pf_test_state_icmp(&pd, &s, &reason); if (action == PF_PASS || action == PF_AFRT) { #if NPFSYNC > 0 @@ -6461,6 +6485,7 @@ pf_test(sa_family_t af, int fwdir, struc if (action == PF_DROP) goto done; } + PF_LOCK(have_lock); action = pf_test_state(&pd, &s, &reason); if (action == PF_PASS || action == PF_AFRT) { #if NPFSYNC > 0 @@ -6658,6 +6683,8 @@ done: else if (!s->if_index_out && dir == PF_OUT)
Re: the very first step towards MULTIPROCESSOR friendly PF
On 27/08/15(Thu) 14:19, Alexandr Nedvedicky wrote: > > > > > Assuming the locking in MULTIPROCESSOR goes like: > > >interrupt grabs splsoftnet -> ip_input -> PF grabs KERNEL_LOCK() > > > We need to take care of ioctl() call path and purge thread. Those need to > > > get synchronize with packets using KERNEL_LOCK(). They should not to mess > > > with > > > splsoftnet() any more in MULTIPROCESSOR version. > > > > As long as we are using (soft-)interrupts to process incoming packets > > they should. > > > > If you don't raise the SPL level to softnet while CPU0 is executing some > > code in ioctl() path and it takes an interrupt at this point it might end > > up executing ip_input() *before* returning to the ioctl() (process) > > context. This scenario might corrupt the states you're trying to protect. > > > > just to verify I got everything right here. I assume I've got MULTIPROCESSOR > kernel with my broken patch, which basically trades or splfotnet() locks for > KERNEL_LOCKS()... > > We have ioctl() operation, which attempts to insert a rule into a global > ruleset, ioctl() grabs KERNEL_LOCK() only and gets to work. (yes my broken > patch does not grab splsoftnet()). > > while ioctl() is working there is interrupt, which delivers packet from NIC. > since no one holds splsoftnet() on CPU packet is free to go and enters > firewall (pf_test()). And that's very bad, interrupt tries to grab > KERNEL_LOCK(), which is still held by ioctl() packet has just interrupted. One > of the two bad things will happen: > > a) deadlock, we will be waiting for KERNEL_LOCK(), which is still held > by interrupted ioctl() operation. > > b) since KERNEL_LOCK() is recursive, the pf_test() will be free to > go possibly finding a list of rules in inconsistent state. In this > case no deadlock happens, but PF plays game with pointers... > > as Kettenis pointed out the KERNEL_LOCK() is recursive, I have an itchy > feeling > we are taking b) option. Is that right? Yep :)
Re: the very first step towards MULTIPROCESSOR friendly PF
> > Assuming the locking in MULTIPROCESSOR goes like: > > interrupt grabs splsoftnet -> ip_input -> PF grabs KERNEL_LOCK() > > We need to take care of ioctl() call path and purge thread. Those need to > > get synchronize with packets using KERNEL_LOCK(). They should not to mess > > with > > splsoftnet() any more in MULTIPROCESSOR version. > > As long as we are using (soft-)interrupts to process incoming packets > they should. > > If you don't raise the SPL level to softnet while CPU0 is executing some > code in ioctl() path and it takes an interrupt at this point it might end > up executing ip_input() *before* returning to the ioctl() (process) > context. This scenario might corrupt the states you're trying to protect. > just to verify I got everything right here. I assume I've got MULTIPROCESSOR kernel with my broken patch, which basically trades or splfotnet() locks for KERNEL_LOCKS()... We have ioctl() operation, which attempts to insert a rule into a global ruleset, ioctl() grabs KERNEL_LOCK() only and gets to work. (yes my broken patch does not grab splsoftnet()). while ioctl() is working there is interrupt, which delivers packet from NIC. since no one holds splsoftnet() on CPU packet is free to go and enters firewall (pf_test()). And that's very bad, interrupt tries to grab KERNEL_LOCK(), which is still held by ioctl() packet has just interrupted. One of the two bad things will happen: a) deadlock, we will be waiting for KERNEL_LOCK(), which is still held by interrupted ioctl() operation. b) since KERNEL_LOCK() is recursive, the pf_test() will be free to go possibly finding a list of rules in inconsistent state. In this case no deadlock happens, but PF plays game with pointers... as Kettenis pointed out the KERNEL_LOCK() is recursive, I have an itchy feeling we are taking b) option. Is that right? > > Does that idea sound right? If not what piece I'm still missing in puzzle? > > I think you're assuming that the "softnet" code path is running in a > thread which is not (yet) the case. yes, my context is not fully switched from Solaris to OpenBSD yet... thanks and regards sasha
Re: the very first step towards MULTIPROCESSOR friendly PF
On 27/08/15(Thu) 11:10, Alexandr Nedvedicky wrote: > On Wed, Aug 26, 2015 at 06:12:10PM +0200, Mark Kettenis wrote: > > > Date: Wed, 26 Aug 2015 17:30:14 +0200 > > > From: Alexandr Nedvedicky > > > > > > Hello, > > > > > > I'm not sure I got everything right in Calgary. So this patch should > > > roughly illustrates how I think we should start moving forward to > > > make PF MULTIPROCESSOR friendly. It's quite likely my proposal/way > > > is completely off, I'll be happy if you put me back to ground. > > > > > > The brief summary of what patch is trying to achieve is as follows: > > > > > > patch trades all splsoftnet() for KERNEL_LOCK() when it gets compiled > > > with MULTIPROCESSOR option on. > > > > > > if MULTIPROCESSOR option is off, the compiler produces PF, which uses > > > splsoftnet. > > > > > > To achieve this the patch introduces macros PF_LOCK()/PF_UNLOCK(), > > > which expand to KERNEL_LOCK()/KERNEL_UNLOCK(), when MULTIPROCESSOR is > > > on. > > > On the other hand if MULTIPROCESSOR is off the PF_*LOCK() macros become > > > splsoftnet()/splx() > > > > I don't think this will work. Even on MULTIPROCESSOR kernels you'll > > need to raise the spl to prevent soft interrupts from running on the > > same CPU. KERNEL_LOCK() will not prevent this from happening as it is > > a recursive lock. This is why OpenBSD's mutexes (spinning locks) > > raise the spl. > > > > So I think you'll have to define PF_LOCK()/PF_UNLOCK() to do the spl > > stuff even for MULTIPROCESSOR kernels. > > Thanks for putting my feet to ground. I'm trying to make some sense from that. > So if I understand splsotfnet() right in MULTIPROCESSOR kernel, it prevents > CPU from handling another network interrupt until the first one is > handled. So that splsoftnet should be raised by network driver, before > packet reaches PF (pf_test() function). pf_test() is not always called in interrupt context. When it is called from a process context you want to make sure the CPU wont try to execute a softnet interrupt that might corrupt the states it is currently messing with. > the pf_test() must protect consistency of firewall data, which are shared > between CPUs (?can we say kernel threads?). So for the first phase, we opt for > KERNEL_LOCK() being grabbed by pf_test() as soon as it gets called. It makes > PF > to still process single packet at time. All other CPUs will have to wait on > KERNEL_LOCK(), holding their splsoftnets(). Do I still follow the concept, or > am I wrong again? You're correct. We cannot say kernel threads right now because incoming packets are *still* processed in (soft-)interrupt context. > Assuming the locking in MULTIPROCESSOR goes like: >interrupt grabs splsoftnet -> ip_input -> PF grabs KERNEL_LOCK() > We need to take care of ioctl() call path and purge thread. Those need to > get synchronize with packets using KERNEL_LOCK(). They should not to mess with > splsoftnet() any more in MULTIPROCESSOR version. As long as we are using (soft-)interrupts to process incoming packets they should. If you don't raise the SPL level to softnet while CPU0 is executing some code in ioctl() path and it takes an interrupt at this point it might end up executing ip_input() *before* returning to the ioctl() (process) context. This scenario might corrupt the states you're trying to protect. > Does that idea sound right? If not what piece I'm still missing in puzzle? I think you're assuming that the "softnet" code path is running in a thread which is not (yet) the case. Martin
Re: the very first step towards MULTIPROCESSOR friendly PF
On Wed, Aug 26, 2015 at 06:12:10PM +0200, Mark Kettenis wrote: > > Date: Wed, 26 Aug 2015 17:30:14 +0200 > > From: Alexandr Nedvedicky > > > > Hello, > > > > I'm not sure I got everything right in Calgary. So this patch should > > roughly illustrates how I think we should start moving forward to > > make PF MULTIPROCESSOR friendly. It's quite likely my proposal/way > > is completely off, I'll be happy if you put me back to ground. > > > > The brief summary of what patch is trying to achieve is as follows: > > > > patch trades all splsoftnet() for KERNEL_LOCK() when it gets compiled > > with MULTIPROCESSOR option on. > > > > if MULTIPROCESSOR option is off, the compiler produces PF, which uses > > splsoftnet. > > > > To achieve this the patch introduces macros PF_LOCK()/PF_UNLOCK(), > > which expand to KERNEL_LOCK()/KERNEL_UNLOCK(), when MULTIPROCESSOR is > > on. > > On the other hand if MULTIPROCESSOR is off the PF_*LOCK() macros become > > splsoftnet()/splx() > > I don't think this will work. Even on MULTIPROCESSOR kernels you'll > need to raise the spl to prevent soft interrupts from running on the > same CPU. KERNEL_LOCK() will not prevent this from happening as it is > a recursive lock. This is why OpenBSD's mutexes (spinning locks) > raise the spl. > > So I think you'll have to define PF_LOCK()/PF_UNLOCK() to do the spl > stuff even for MULTIPROCESSOR kernels. Thanks for putting my feet to ground. I'm trying to make some sense from that. So if I understand splsotfnet() right in MULTIPROCESSOR kernel, it prevents CPU from handling another network interrupt until the first one is handled. So that splsoftnet should be raised by network driver, before packet reaches PF (pf_test() function). the pf_test() must protect consistency of firewall data, which are shared between CPUs (?can we say kernel threads?). So for the first phase, we opt for KERNEL_LOCK() being grabbed by pf_test() as soon as it gets called. It makes PF to still process single packet at time. All other CPUs will have to wait on KERNEL_LOCK(), holding their splsoftnets(). Do I still follow the concept, or am I wrong again? Assuming the locking in MULTIPROCESSOR goes like: interrupt grabs splsoftnet -> ip_input -> PF grabs KERNEL_LOCK() We need to take care of ioctl() call path and purge thread. Those need to get synchronize with packets using KERNEL_LOCK(). They should not to mess with splsoftnet() any more in MULTIPROCESSOR version. Does that idea sound right? If not what piece I'm still missing in puzzle? thanks and regards sasha
Re: the very first step towards MULTIPROCESSOR friendly PF
> Date: Wed, 26 Aug 2015 17:30:14 +0200 > From: Alexandr Nedvedicky > > Hello, > > I'm not sure I got everything right in Calgary. So this patch should > roughly illustrates how I think we should start moving forward to > make PF MULTIPROCESSOR friendly. It's quite likely my proposal/way > is completely off, I'll be happy if you put me back to ground. > > The brief summary of what patch is trying to achieve is as follows: > > patch trades all splsoftnet() for KERNEL_LOCK() when it gets compiled > with MULTIPROCESSOR option on. > > if MULTIPROCESSOR option is off, the compiler produces PF, which uses > splsoftnet. > > To achieve this the patch introduces macros PF_LOCK()/PF_UNLOCK(), > which expand to KERNEL_LOCK()/KERNEL_UNLOCK(), when MULTIPROCESSOR is > on. > On the other hand if MULTIPROCESSOR is off the PF_*LOCK() macros become > splsoftnet()/splx() I don't think this will work. Even on MULTIPROCESSOR kernels you'll need to raise the spl to prevent soft interrupts from running on the same CPU. KERNEL_LOCK() will not prevent this from happening as it is a recursive lock. This is why OpenBSD's mutexes (spinning locks) raise the spl. So I think you'll have to define PF_LOCK()/PF_UNLOCK() to do the spl stuff even for MULTIPROCESSOR kernels.