On Wed, Apr 21, 2021 at 09:36:11PM +0200, Alexander Bluhm wrote: > Hi, > > For a while we are running network without kernel lock, but with a > network lock. The latter is an exclusive sleeping rwlock. > > It is possible to run the forwarding path in parallel on multiple > cores. I use ix(4) interfaces which provide one input queue for > each CPU. For that we have to start multiple softnet tasks and > replace the exclusive lock with a shared lock. This works for IP > and IPv6 input and forwarding, but not for higher protocols. > > So I implement a queue between IP and higher layers. We had that > before when we were using netlock for IP and kernel lock for TCP. > Now we have shared lock for IP and exclusive lock for TCP. By using > a queue, we can upgrade the lock once for multiple packets. > > As you can see here, forwardings performance doubles from 4.5x10^9 > to 9x10^9 . Left column is current, right column is with my diff. > The other dots at 2x10^9 are with socket splicing which is not > affected. > http://bluhm.genua.de/perform/results/2021-04-21T10%3A50%3A37Z/gnuplot/forward.png > > Here are all numbers with various network tests. > http://bluhm.genua.de/perform/results/2021-04-21T10%3A50%3A37Z/perform.html > TCP performance gets less deterministic due to the addition queue. > > Kernel stack flame graph looks like this. Machine uses 4 CPU. > http://bluhm.genua.de/files/kstack-multiqueue-forward.svg > > Note the kernel lock around nd6_resolve(). I hat to put it there > as I have seen an MP related crash there. This can be fixed > independently of this diff. > > We need more MP preassure to find such bugs and races. I think now > is a good time to give this diff broader testing and commit it. > You need interfaces with multiple queues to see a difference. > > ok? > > bluhm >
Hi. Did you tested your diff with ipsec(4) enabled? I'm asking because we have this in net/pfkeyv2.c: 1108 pfkeyv2_send(struct socket *so, void *message, int len) 1109 { .... 2013 ipsec_in_use++; 2014 /* 2015 * XXXSMP IPsec data structures are not ready to be 2016 * accessed by multiple Network threads in parallel, 2017 * so force all packets to be processed by the first 2018 * one. 2019 */ 2020 extern int nettaskqs; 2021 nettaskqs = 1;