Re: [PATCH] softirq: let ksoftirqd do its job
On Fri, Sep 23, 2016 at 06:51:04PM +0200, Jesper Dangaard Brouer wrote: > This is your git tree, right: > https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/ > > Doesn't look like you pushed it yet, or do I need to look at a specific > branch? I mainly work from a local quilt queue which I feed to mingo. I occasionally push out to get build-bot coverage or have people look at bits I poked together. That said, I'll try and do a push later tonight. Do note however, that git tree is a complete wipe and rebuild, don't expect any kind of continuity from it.
Re: [PATCH] softirq: let ksoftirqd do its job
On Fri, 23 Sep 2016 13:53:33 +0200 Peter Zijlstra wrote: > On Fri, Sep 23, 2016 at 01:35:59PM +0200, Daniel Borkmann wrote: > > On 09/02/2016 08:39 AM, David Miller wrote: > > > > > >I'm just kind of assuming this won't go through my tree, but I can take > > >it if that's what everyone agrees to. > > > > Was this actually picked up somewhere in the mean time? > > I can queue it for tip. In fact, I've just done so to avoid loosing it. > If anybody else wants it holler. Good that you are picking this up! It is a very important fix, as least for networking. This is your git tree, right: https://git.kernel.org/cgit/linux/kernel/git/peterz/queue.git/ Doesn't look like you pushed it yet, or do I need to look at a specific branch? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On Fri, Sep 23, 2016 at 01:35:59PM +0200, Daniel Borkmann wrote: > On 09/02/2016 08:39 AM, David Miller wrote: > > > >I'm just kind of assuming this won't go through my tree, but I can take > >it if that's what everyone agrees to. > > Was this actually picked up somewhere in the mean time? I can queue it for tip. In fact, I've just done so to avoid loosing it. If anybody else wants it holler.
Re: [PATCH] softirq: let ksoftirqd do its job
On 09/02/2016 08:39 AM, David Miller wrote: From: Eric Dumazet Date: Wed, 31 Aug 2016 10:42:29 -0700 From: Eric Dumazet A while back, Paolo and Hannes sent an RFC patch adding threaded-able napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) The problem seems to be that softirqs are very aggressive and are often handled by the current process, even if we are under stress and that ksoftirqd was scheduled, so that innocent threads would have more chance to make progress. This patch makes sure that if ksoftirq is running, we let it perform the softirq work. Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/ Tested: - NIC receiving traffic handled by CPU 0 - UDP receiver running on CPU 0, using a single UDP socket. - Incoming flood of UDP packets targeting the UDP socket. Before the patch, the UDP receiver could almost never get cpu cycles and could only receive ~2,000 packets per second. After the patch, cpu cycles are split 50/50 between user application and ksoftirqd/0, and we can effectively read ~900,000 packets per second, a huge improvement in DOS situation. (Note that more packets are now dropped by the NIC itself, since the BH handlers get less cpu cycles to drain RX ring buffer) Since the load runs in well identified threads context, an admin can more easily tune process scheduling parameters if needed. Reported-by: Paolo Abeni Reported-by: Hannes Frederic Sowa Signed-off-by: Eric Dumazet I'm just kind of assuming this won't go through my tree, but I can take it if that's what everyone agrees to. Was this actually picked up somewhere in the mean time?
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 1 Sep 2016 17:28:02 +0200 Peter Zijlstra wrote: > On Thu, Sep 01, 2016 at 03:30:42PM +0200, Jesper Dangaard Brouer wrote: > > Still... enabled! > > Hmmm.. more idea how to disable this??? > > I think you ought to be able to assign yourself to the root cgroup, > something like: > > echo $$ > /cgroup/tasks > > or wheverever the cpu-cgroup controller is mounted at. > > But its been a fair while since I touched any of that, its not a CONFIG > I have enabled much. I could not figure out how to disable autogroups, so I ended up compiling the kernel without CONFIG_SCHED_AUTOGROUP. PID PR S %CPU TIME+ COMMAND 3 20 R 20.7 0:53.05 ksoftirqd/0 9299 20 R 16.3 0:03.62 udp_sink 9296 20 S 16.0 0:03.59 udp_sink 9297 20 R 16.0 0:03.58 udp_sink 9298 20 R 16.0 0:03.57 udp_sink 9295 20 R 15.3 0:03.43 udp_sink Top new shows the CPU distribution is more correct, thus we can concluded the artifact I saw was indeed caused by autogroup. I can also confirm that my netperf UDP_STREAM tests now work again, but I need around 32 parallel netperf to counter the effectiveness of the ksoftirqd process. While I only need 5 udp_sink programs. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
From: Eric Dumazet Date: Wed, 31 Aug 2016 10:42:29 -0700 > From: Eric Dumazet > > A while back, Paolo and Hannes sent an RFC patch adding threaded-able > napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) > > The problem seems to be that softirqs are very aggressive and are often > handled by the current process, even if we are under stress and that > ksoftirqd was scheduled, so that innocent threads would have more chance > to make progress. > > This patch makes sure that if ksoftirq is running, we let it > perform the softirq work. > > Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/ > > Tested: > > - NIC receiving traffic handled by CPU 0 > - UDP receiver running on CPU 0, using a single UDP socket. > - Incoming flood of UDP packets targeting the UDP socket. > > Before the patch, the UDP receiver could almost never get cpu cycles and > could only receive ~2,000 packets per second. > > After the patch, cpu cycles are split 50/50 between user application and > ksoftirqd/0, and we can effectively read ~900,000 packets per second, > a huge improvement in DOS situation. (Note that more packets are now > dropped by the NIC itself, since the BH handlers get less cpu cycles to > drain RX ring buffer) > > Since the load runs in well identified threads context, an admin can > more easily tune process scheduling parameters if needed. > > Reported-by: Paolo Abeni > Reported-by: Hannes Frederic Sowa > Signed-off-by: Eric Dumazet I'm just kind of assuming this won't go through my tree, but I can take it if that's what everyone agrees to.
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, Sep 01, 2016 at 03:30:42PM +0200, Jesper Dangaard Brouer wrote: > Still... enabled! > Hmmm.. more idea how to disable this??? I think you ought to be able to assign yourself to the root cgroup, something like: echo $$ > /cgroup/tasks or wheverever the cpu-cgroup controller is mounted at. But its been a fair while since I touched any of that, its not a CONFIG I have enabled much.
Re: [PATCH] softirq: let ksoftirqd do its job
On 01.09.2016 14:57, Eric Dumazet wrote: > On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote: > >> Correction, on the server-under-test, I'm actually running RHEL7.2 >> >> >>> How do I verify/check if I have enabled a cpu-cgroup? >> >> Hannes says I can look in "/proc/self/cgroup" >> >> $ cat /proc/self/cgroup >> 7:net_cls:/ >> 6:blkio:/ >> 5:devices:/ >> 4:perf_event:/ >> 3:cpu,cpuacct:/ >> 2:cpuset:/ >> 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope >> >> And that "/" indicate I've not enabled cgroups, right? >> > > In my experience, I found that times displayed by top are often off for > softirq processing. > > Before applying my patch, top shows very small amount of cpu time for > udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy. > > Make sure to try latest Linus tree, as I did yesterday, because > apparently things are better than a few weeks back. > > BTW, even 'perf top' has sometimes problems showing me cycles spent in > softirq. I need to make sure the cpu processing NIC interrupts also > spend cycles in some user space program to get meaningful results. I think that ksoftirqd time is actually accounted to system: excerpt from irqtime_account_process_tick in kernel/sched/cputime.c if (this_cpu_ksoftirqd() == p) { /* * ksoftirqd time do not get accounted in cpu_softirq_time. * So, we have to handle it separately here. * Also, p->stime needs to be updated for ksoftirqd. */ __account_system_time(p, cputime, scaled, CPUTIME_SOFTIRQ); } else if (user_tick) {
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 1 Sep 2016 14:48:39 +0200 Peter Zijlstra wrote: > On Thu, Sep 01, 2016 at 02:38:59PM +0200, Jesper Dangaard Brouer wrote: > > On Thu, 1 Sep 2016 14:29:25 +0200 > > Jesper Dangaard Brouer wrote: > > > > > On Thu, 1 Sep 2016 13:53:56 +0200 > > > Peter Zijlstra wrote: > > > > > > > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote: > > > > > > > > >PID S %CPU TIME+ COMMAND > > > > > 3 R 50.0 29:02.23 ksoftirqd/0 > > > > > 10881 R 10.7 1:01.61 udp_sink > > > > > 10837 R 10.0 1:05.20 udp_sink > > > > > 10852 S 10.0 1:01.78 udp_sink > > > > > 10862 R 10.0 1:05.19 udp_sink > > > > > 10844 S 9.7 1:01.91 udp_sink > > > > > > > > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? > > > > > > > > > > > > > Do you run your udp_sink thingy in a cpu-cgroup? > > > > > > That was also Paolo's feedback (IRC). I'm not aware of it, but it > > > might be some distribution (Fedora 22) default thing. > > > > Correction, on the server-under-test, I'm actually running RHEL7.2 > > > > > > > How do I verify/check if I have enabled a cpu-cgroup? > > > > Hannes says I can look in "/proc/self/cgroup" > > > > $ cat /proc/self/cgroup > > 7:net_cls:/ > > 6:blkio:/ > > 5:devices:/ > > 4:perf_event:/ > > 3:cpu,cpuacct:/ > > 2:cpuset:/ > > 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope > > > > And that "/" indicate I've not enabled cgroups, right? > > Mostly so. I think RHEL/Fedora has SCHED_AUTOGROUP enabled, and you can > find that through: > > cat /proc/self/autogroup $ cat /proc/self/autogroup /autogroup-88 nice 0 > And disable with the noautogroup boot param, or: > > echo 0 > /proc/sys/kernel/sched_autogroup_enabled Looks like it is enabled on my system: $ grep -H . /proc/sys/kernel/sched_autogroup_enabled /proc/sys/kernel/sched_autogroup_enabled:1 > although this latter will leave the current state intact while avoiding > creation of any further autogroups iirc. $ sudo sh -c 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' $ grep -H . /proc/sys/kernel/sched_autogroup_enabled /proc/sys/kernel/sched_autogroup_enabled:0 $ sudo systemctl restart sshd Starting new SSH login: $ cat /proc/self/autogroup /autogroup-153 nice 0 Hmmm, still enabled... $ sudo systemctl stop sshd $ sudo systemctl start sshd $ grep -H . /proc/sys/kernel/sched_autogroup_enabled /proc/sys/kernel/sched_autogroup_enabled:0 $ cat /proc/self/autogroup /autogroup-158 nice 0 Still... enabled! Hmmm.. more idea how to disable this??? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 2016-09-01 at 15:00 +0200, Hannes Frederic Sowa wrote: > On 01.09.2016 14:57, Eric Dumazet wrote: > > On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote: > > > >> Correction, on the server-under-test, I'm actually running RHEL7.2 > >> > >> > >>> How do I verify/check if I have enabled a cpu-cgroup? > >> > >> Hannes says I can look in "/proc/self/cgroup" > >> > >> $ cat /proc/self/cgroup > >> 7:net_cls:/ > >> 6:blkio:/ > >> 5:devices:/ > >> 4:perf_event:/ > >> 3:cpu,cpuacct:/ > >> 2:cpuset:/ > >> 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope > >> > >> And that "/" indicate I've not enabled cgroups, right? > >> > > > > In my experience, I found that times displayed by top are often off for > > softirq processing. > > > > Before applying my patch, top shows very small amount of cpu time for > > udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy. > > > > Make sure to try latest Linus tree, as I did yesterday, because > > apparently things are better than a few weeks back. > > > > BTW, even 'perf top' has sometimes problems showing me cycles spent in > > softirq. I need to make sure the cpu processing NIC interrupts also > > spend cycles in some user space program to get meaningful results. > > I think that ksoftirqd time is actually accounted to system: > > excerpt from irqtime_account_process_tick in kernel/sched/cputime.c > > if (this_cpu_ksoftirqd() == p) { > /* >* ksoftirqd time do not get accounted in cpu_softirq_time. >* So, we have to handle it separately here. >* Also, p->stime needs to be updated for ksoftirqd. >*/ > __account_system_time(p, cputime, scaled, CPUTIME_SOFTIRQ); > } else if (user_tick) { > Tell me more about kernel/sched/cputime.c stability over recent linux versions ;) git log --oneline v4.2.. kernel/sched/cputime.c 03cbc732639ddcad15218c4b2046d255851ff1e3 sched/cputime: Resync steal time when guest & host lose sync 173be9a14f7b2e901cf77c18b1aafd4d672e9d9e sched/cputime: Fix NO_HZ_FULL getrusage() monotonicity regression 26f2c75cd2cf10a6120ef02ca9a94db77cc9c8e0 sched/cputime: Fix omitted ticks passed in parameter f9bcf1e0e0145323ba2cf72ecad5264ff3883eb1 sched/cputime: Fix steal time accounting 08fd8c17686c6b09fa410a26d516548dd80ff147 Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip 553bf6bbfd8a540c70aee28eb50e24caff456a03 sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() 0cfdf9a198b0d4f5ad6c87d894db7830b796b2cc sched/cputime: Clean up the old vtime gen irqtime accounting completely b58c35840521bb02b150e1d0d34ca9197f8b7145 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code 57430218317e5b280a80582a139b26029c25de6c sched/cputime: Count actually elapsed irq & softirq time ecb23dc6f2eff0ce64dd60351a81f376f13b12cc xen: add steal_clock support on x86 807e5b80687c06715d62df51a5473b231e3e8b15 sched/cputime: Add steal time support to full dynticks CPU time accounting f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d sched/cputime: Fix steal_account_process_tick() to always return jiffies ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity c9bed1cf51011c815d88288b774865d013ca78a8 Merge tag 'for-linus-4.5-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip 1fe7c4ef88bd32e039f5f4126537c3f20c340414 missing include asm/paravirt.h in cputime.c b7ce2277f087fd052e7e1bbf432f7fecbee82bb6 sched/cputime: Convert vtime_seqlock to seqcount e592539466380279a9e6e6fdfe4545aa54f22593 sched/cputime: Introduce vtime accounting check for readers 55dbdcfa05533f44c9416070b8a9f6432b22314a sched/cputime: Rename vtime_accounting_enabled() to vtime_accounting_cpu_enabled() cab245d68c38afff1a4c4d018ab7e1d316982f5d sched/cputime: Correctly handle task guest time on housekeepers 7098c1eac75dc03fdbb7249171a6e68ce6044a5a sched/cputime: Clarify vtime symbols and document them 7877a0ba5ec63c7b0111b06c773f1696fa17b35a sched/cputime: Remove extra cost in task_cputime() 2541117b0cf79977fa11a0d6e17d61010677bd7b sched/cputime: Fix invalid gtime in proc 9eec50b8bbe1535c440a1ee88c1958f78fc55957 kvm/x86: Hyper-V HV_X64_MSR_VP_RUNTIME support 9d7fb04276481c59610983362d8e023d262b58ca sched/cputime: Guarantee stime + utime == rtime
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 2016-09-01 at 12:38 +0200, Jesper Dangaard Brouer wrote: > I see max queue of 47MBytes, and worse an average standing queue of > 25Mbytes, which is really bad for the latency seen by the > application. And having this much outstanding memory is also bad for > CPU cache size effects, and stressing the memory allocator. > I'm actually using this huge queue "misconfig" to stress the page > allocator and my page_pool implementation into worse case situations ;-) > Since commit 95766fff6b9a78d11f ("[UDP]: Add memory accounting."), it is dangerous to have a big SO_RCVBUF value, since it adds unexpected recvmsg() latencies. 1) User thread locks the socket. 2) Gets one skb from receive queue 3) incoming flood of UDP packets are processed by softirq 4) Socket is found 'owned by the user' 5) packets are parked into the 'socket backlog' up to the SO_RCVBUF limit 6) User thread release the socket. 7) It finds many skbs in the backlog and have to process them _all_ and re-inject in socket receive queue. 8) return to user space. Time spent in 7) can me in the order of millions of cpu cycles... At least starting from 5413d1babe8f10d ("net: do not block BH while processing socket backlog") we no longer block BH while doing 7) and we have cond resched points.
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 2016-09-01 at 14:38 +0200, Jesper Dangaard Brouer wrote: > Correction, on the server-under-test, I'm actually running RHEL7.2 > > > > How do I verify/check if I have enabled a cpu-cgroup? > > Hannes says I can look in "/proc/self/cgroup" > > $ cat /proc/self/cgroup > 7:net_cls:/ > 6:blkio:/ > 5:devices:/ > 4:perf_event:/ > 3:cpu,cpuacct:/ > 2:cpuset:/ > 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope > > And that "/" indicate I've not enabled cgroups, right? > In my experience, I found that times displayed by top are often off for softirq processing. Before applying my patch, top shows very small amount of cpu time for udp_rcv and ksoftirqd/0 , while obviously cpu 0 is completely busy. Make sure to try latest Linus tree, as I did yesterday, because apparently things are better than a few weeks back. BTW, even 'perf top' has sometimes problems showing me cycles spent in softirq. I need to make sure the cpu processing NIC interrupts also spend cycles in some user space program to get meaningful results.
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 2016-09-01 at 14:05 +0200, Hannes Frederic Sowa wrote: > Would it make sense to include used socket backlog in udp socket lookup > compute_score calculation? Just want to throw out the idea, I actually > could imagine to also cause bad side effects. Hopefully we can get rid of the backlog for UDP, by no longer having to lock the socket in RX path, and perform memory charging in a better way. The backlog for TCP is problematic for high speed flows, and for UDP it is problematic in flood situations as a single recvmsg() might have to process thousands of skbs before returning to user space. What you suggest is going to be difficult : 1) Packets of a 5-tuple (eg QUIC flow) wont all land to the same silo, and will cause reorders or application issues. 2) SO_ATTACH_REUSEPORT_CBPF wont have access to the socket(s) backlog to perform the choice. Thanks.
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, Sep 01, 2016 at 02:38:59PM +0200, Jesper Dangaard Brouer wrote: > On Thu, 1 Sep 2016 14:29:25 +0200 > Jesper Dangaard Brouer wrote: > > > On Thu, 1 Sep 2016 13:53:56 +0200 > > Peter Zijlstra wrote: > > > > > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote: > > > >PID S %CPU TIME+ COMMAND > > > > 3 R 50.0 29:02.23 ksoftirqd/0 > > > > 10881 R 10.7 1:01.61 udp_sink > > > > 10837 R 10.0 1:05.20 udp_sink > > > > 10852 S 10.0 1:01.78 udp_sink > > > > 10862 R 10.0 1:05.19 udp_sink > > > > 10844 S 9.7 1:01.91 udp_sink > > > > > > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? > > > > > > Do you run your udp_sink thingy in a cpu-cgroup? > > > > That was also Paolo's feedback (IRC). I'm not aware of it, but it > > might be some distribution (Fedora 22) default thing. > > Correction, on the server-under-test, I'm actually running RHEL7.2 > > > > How do I verify/check if I have enabled a cpu-cgroup? > > Hannes says I can look in "/proc/self/cgroup" > > $ cat /proc/self/cgroup > 7:net_cls:/ > 6:blkio:/ > 5:devices:/ > 4:perf_event:/ > 3:cpu,cpuacct:/ > 2:cpuset:/ > 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope > > And that "/" indicate I've not enabled cgroups, right? Mostly so. I think RHEL/Fedora has SCHED_AUTOGROUP enabled, and you can find that through: cat /proc/self/autogroup And disable with the noautogroup boot param, or: echo 0 > /proc/sys/kernel/sched_autogroup_enabled although this latter will leave the current state intact while avoiding creation of any further autogroups iirc.
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 1 Sep 2016 14:29:25 +0200 Jesper Dangaard Brouer wrote: > On Thu, 1 Sep 2016 13:53:56 +0200 > Peter Zijlstra wrote: > > > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote: > > >PID S %CPU TIME+ COMMAND > > > 3 R 50.0 29:02.23 ksoftirqd/0 > > > 10881 R 10.7 1:01.61 udp_sink > > > 10837 R 10.0 1:05.20 udp_sink > > > 10852 S 10.0 1:01.78 udp_sink > > > 10862 R 10.0 1:05.19 udp_sink > > > 10844 S 9.7 1:01.91 udp_sink > > > > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? > > > > Do you run your udp_sink thingy in a cpu-cgroup? > > That was also Paolo's feedback (IRC). I'm not aware of it, but it > might be some distribution (Fedora 22) default thing. Correction, on the server-under-test, I'm actually running RHEL7.2 > How do I verify/check if I have enabled a cpu-cgroup? Hannes says I can look in "/proc/self/cgroup" $ cat /proc/self/cgroup 7:net_cls:/ 6:blkio:/ 5:devices:/ 4:perf_event:/ 3:cpu,cpuacct:/ 2:cpuset:/ 1:name=systemd:/user.slice/user-1000.slice/session-c1.scope And that "/" indicate I've not enabled cgroups, right? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, 1 Sep 2016 13:53:56 +0200 Peter Zijlstra wrote: > On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote: > >PID S %CPU TIME+ COMMAND > > 3 R 50.0 29:02.23 ksoftirqd/0 > > 10881 R 10.7 1:01.61 udp_sink > > 10837 R 10.0 1:05.20 udp_sink > > 10852 S 10.0 1:01.78 udp_sink > > 10862 R 10.0 1:05.19 udp_sink > > 10844 S 9.7 1:01.91 udp_sink > > > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? > > Do you run your udp_sink thingy in a cpu-cgroup? That was also Paolo's feedback (IRC). I'm not aware of it, but it might be some distribution (Fedora 22) default thing. How do I verify/check if I have enabled a cpu-cgroup? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On 31.08.2016 22:42, Eric Dumazet wrote: > On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote: > >> I can confirm the improvement of approx 900Kpps (no wonder people have >> been complaining about DoS against UDP/DNS servers). >> >> BUT during my extensive testing, of this patch, I also think that we >> have not gotten to the bottom of this. I was expecting to see a higher >> (collective) PPS number as I add more UDP servers, but I don't. >> >> Running many UDP netperf's with command: >> super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N > > Are you sure sender can send fast enough ? > >> >> With 'top' I can see ksoftirq are still getting a higher %CPU time: >> >> PID %CPU TIME+ COMMAND >> 3 36.5 2:28.98 ksoftirqd/0 >> 107249.6 0:01.05 netserver >> 107229.3 0:01.05 netserver >> 107239.3 0:01.05 netserver >> 107259.3 0:01.05 netserver > > Looks much better on my machine, with "udprcv -n 4" (using 4 threads, > and 4 sockets using SO_REUSEPORT) Would it make sense to include used socket backlog in udp socket lookup compute_score calculation? Just want to throw out the idea, I actually could imagine to also cause bad side effects.
Re: [PATCH] softirq: let ksoftirqd do its job
On 31.08.2016 19:42, Eric Dumazet wrote: > From: Eric Dumazet > > A while back, Paolo and Hannes sent an RFC patch adding threaded-able > napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) > > The problem seems to be that softirqs are very aggressive and are often > handled by the current process, even if we are under stress and that > ksoftirqd was scheduled, so that innocent threads would have more chance > to make progress. > > This patch makes sure that if ksoftirq is running, we let it > perform the softirq work. > > Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/ > > Tested: > > - NIC receiving traffic handled by CPU 0 > - UDP receiver running on CPU 0, using a single UDP socket. > - Incoming flood of UDP packets targeting the UDP socket. > > Before the patch, the UDP receiver could almost never get cpu cycles and > could only receive ~2,000 packets per second. > > After the patch, cpu cycles are split 50/50 between user application and > ksoftirqd/0, and we can effectively read ~900,000 packets per second, > a huge improvement in DOS situation. (Note that more packets are now > dropped by the NIC itself, since the BH handlers get less cpu cycles to > drain RX ring buffer) > > Since the load runs in well identified threads context, an admin can > more easily tune process scheduling parameters if needed. > > Reported-by: Paolo Abeni > Reported-by: Hannes Frederic Sowa > Signed-off-by: Eric Dumazet > Cc: David Miller Cc: Jesper Dangaard Brouer > Cc: Peter Zijlstra > Cc: Rik van Riel Acked-by: Hannes Frederic Sowa Thanks, Hannes
Re: [PATCH] softirq: let ksoftirqd do its job
On Thu, Sep 01, 2016 at 01:02:31PM +0200, Jesper Dangaard Brouer wrote: >PID S %CPU TIME+ COMMAND > 3 R 50.0 29:02.23 ksoftirqd/0 > 10881 R 10.7 1:01.61 udp_sink > 10837 R 10.0 1:05.20 udp_sink > 10852 S 10.0 1:01.78 udp_sink > 10862 R 10.0 1:05.19 udp_sink > 10844 S 9.7 1:01.91 udp_sink > > This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? Do you run your udp_sink thingy in a cpu-cgroup?
Re: [PATCH] softirq: let ksoftirqd do its job
On 01.09.2016 13:02, Jesper Dangaard Brouer wrote: > On Wed, 31 Aug 2016 23:51:16 +0200 > Jesper Dangaard Brouer wrote: > >> On Wed, 31 Aug 2016 13:42:30 -0700 >> Eric Dumazet wrote: >> >>> On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote: >>> I can confirm the improvement of approx 900Kpps (no wonder people have been complaining about DoS against UDP/DNS servers). BUT during my extensive testing, of this patch, I also think that we have not gotten to the bottom of this. I was expecting to see a higher (collective) PPS number as I add more UDP servers, but I don't. Running many UDP netperf's with command: super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N >>> >>> Are you sure sender can send fast enough ? >> >> Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching >> to pktgen and udp_sink to be sure. >> With 'top' I can see ksoftirq are still getting a higher %CPU time: PID %CPU TIME+ COMMAND 3 36.5 2:28.98 ksoftirqd/0 107249.6 0:01.05 netserver 107229.3 0:01.05 netserver 107239.3 0:01.05 netserver 107259.3 0:01.05 netserver >>> >>> Looks much better on my machine, with "udprcv -n 4" (using 4 threads, >>> and 4 sockets using SO_REUSEPORT) >>> >>> 10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv >>> 3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 >>> ksoftirqd/0 >>> >>> Pressing 'H' in top gives : >>> >>> 3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 >>> ksoftirqd/0 >>> 10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv >>> 10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv >>> 10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv >>> 10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv >> >> Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]: >> sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count >> $((10**10)) >> >> PID S %CPU TIME+ COMMAND >> 3 R 21.6 2:21.33 ksoftirqd/0 >> 3838 R 15.9 0:02.18 udp_sink >> 3856 R 15.6 0:02.16 udp_sink >> 3862 R 15.6 0:02.16 udp_sink >> 3844 R 15.3 0:02.15 udp_sink >> 3850 S 15.3 0:02.15 udp_sink >> >> This is the expected result, that adding more userspace receivers >> scales up. I needed 5 udp_sink's before I don't see any drops, either >> this says the job performed by ksoftirqd is 5 times faster or the >> collective queue size of the programs was fast enough to absorb the >> scheduling jitter. > > I need some help from scheduler people explaining this! > > In above run of udp_sink (which had expected behavior), I ran udp_sink > in 5 different xterm/shells. Below, I'm running all 5 udp_sink > programs from the same bash shell (just backgrounding them). > >PID S %CPU TIME+ COMMAND > 3 R 50.0 29:02.23 ksoftirqd/0 > 10881 R 10.7 1:01.61 udp_sink > 10837 R 10.0 1:05.20 udp_sink > 10852 S 10.0 1:01.78 udp_sink > 10862 R 10.0 1:05.19 udp_sink > 10844 S 9.7 1:01.91 udp_sink Could you enable schedstats (sysctl schedstats) and show /proc/ksoftirq*/sched? Thanks, Hannes
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 31 Aug 2016 23:51:16 +0200 Jesper Dangaard Brouer wrote: > On Wed, 31 Aug 2016 13:42:30 -0700 > Eric Dumazet wrote: > > > On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote: > > > > > I can confirm the improvement of approx 900Kpps (no wonder people have > > > been complaining about DoS against UDP/DNS servers). > > > > > > BUT during my extensive testing, of this patch, I also think that we > > > have not gotten to the bottom of this. I was expecting to see a higher > > > (collective) PPS number as I add more UDP servers, but I don't. > > > > > > Running many UDP netperf's with command: > > > super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n > > > -N > > > > Are you sure sender can send fast enough ? > > Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching > to pktgen and udp_sink to be sure. > > > > > > > With 'top' I can see ksoftirq are still getting a higher %CPU time: > > > > > > PID %CPU TIME+ COMMAND > > > 3 36.5 2:28.98 ksoftirqd/0 > > > 107249.6 0:01.05 netserver > > > 107229.3 0:01.05 netserver > > > 107239.3 0:01.05 netserver > > > 107259.3 0:01.05 netserver > > > > Looks much better on my machine, with "udprcv -n 4" (using 4 threads, > > and 4 sockets using SO_REUSEPORT) > > > > 10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv > > 3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 > > ksoftirqd/0 > > > > Pressing 'H' in top gives : > > > > 3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 > > ksoftirqd/0 > > 10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv > > 10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv > > 10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv > > 10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv > > Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]: > sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count > $((10**10)) > > PID S %CPU TIME+ COMMAND > 3 R 21.6 2:21.33 ksoftirqd/0 > 3838 R 15.9 0:02.18 udp_sink > 3856 R 15.6 0:02.16 udp_sink > 3862 R 15.6 0:02.16 udp_sink > 3844 R 15.3 0:02.15 udp_sink > 3850 S 15.3 0:02.15 udp_sink > > This is the expected result, that adding more userspace receivers > scales up. I needed 5 udp_sink's before I don't see any drops, either > this says the job performed by ksoftirqd is 5 times faster or the > collective queue size of the programs was fast enough to absorb the > scheduling jitter. I need some help from scheduler people explaining this! In above run of udp_sink (which had expected behavior), I ran udp_sink in 5 different xterm/shells. Below, I'm running all 5 udp_sink programs from the same bash shell (just backgrounding them). PID S %CPU TIME+ COMMAND 3 R 50.0 29:02.23 ksoftirqd/0 10881 R 10.7 1:01.61 udp_sink 10837 R 10.0 1:05.20 udp_sink 10852 S 10.0 1:01.78 udp_sink 10862 R 10.0 1:05.19 udp_sink 10844 S 9.7 1:01.91 udp_sink This is strange, why is ksoftirqd/0 getting 50% of the CPU time??? And I'm no-longer getting the full tput delivered into userspace (as I did before with 5 receivers). $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives12343680.0 IpInDelivers12343680.0 UdpInDatagrams 11339710.0 UdpInErrors 80332 0.0 UdpRcvbufErrors 80332 0.0 IpExtInOctets 56792704 0.0 IpExtInNoECTPkts12346240.0 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 31 Aug 2016 16:29:56 -0700 Rick Jones wrote: > On 08/31/2016 04:11 PM, Eric Dumazet wrote: > > On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: > >> With regard to drops, are both of you sure you're using the same socket > >> buffer sizes? > > > > Does it really matter ? > > At least at points in the past I have seen different drop counts at the > SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis > I was operating under at the time was that this dealt with those > situations where the netserver was held-off from running for "a little > while" from time to time. It didn't change things for a sustained > overload situation though. Yes, Rick, your hypothesis corresponds to my measurements. The userspace program is held-off from running for "a little while" from time to time. I've measured this with perf sched record/latency. It is sort of a natural scheduler characteristic. The userspace UDP socket program consume/need more cycles to perform its jobs, than kernel softirqd. Thus the UDP-prog use up its sched time-slice, and periodically ksoftirq get schedule multiple times, because UDP-prog don't have any credits any-longer. WARNING: Do not increase socket queue size to pamper over this issue, it is the WRONG solution, it will give horrible latency issues. With above warning, I can tell your, yes you are also right about increasing the socket buffer size, can be used to mitigate/hide the packet drops. You can even increase the socket size so much, that the drop problem "goes-away". The queue simply need to be deep enough to absorb the worst/maximum time UDP-prog was scheduled out. The hidden effect to make this work (to not contradict queue theory) is that this also slows-down/cost-more-cycles for ksoftirqd/NAPI as it cost more to enqueue (instead of dropping packets on a full queue). You can measure the sched "Maximum delay" using: sudo perf sched record -C 0 sleep 10 sudo perf sched latency On my setup I measured "Maximum delay" of approx 9 ms. Given I can see an incoming packet rate of 2.4Mpps (880Kpps reach UDP-prog), and knowing network stack use skb->truesize (approx 2048 bytes on this driver), I can calculate that I need approx 45MBytes buffer ((2.4*10^6)*(9/1000)*2048 = 44.2Mb) The PPS measurement comes from: $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives23359260.0 IpInDelivers23359250.0 UdpInDatagrams 880086 0.0 UdpInErrors 14558500.0 UdpRcvbufErrors 14558500.0 IpExtInOctets 107453056 0.0 Changing queue size to 50MBytes : sysctl -w net/core/rmem_max=$((50*1024*1024)) ;\ sysctl -w net.core.rmem_default=$((50*1024*1024)) New result looks "nice", with no drops, and 1.42Mpps delivered to UDP-prog, but in reality it is not nice for latency... $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives14250130.0 IpInDelivers14250170.0 UdpInDatagrams 14321390.0 IpExtInOctets 65539328 0.0 IpExtInNoECTPkts14247710.0 Tracking of queue size, max, min and average:: while (true); do netstat -uan | grep '0.0.0.0:9'; sleep 0.3; done | awk 'BEGIN {max=0;min=0x;sum=0;n=0} \ {if ($2 > max) max=$2; if ($2 < min) min=$2; n++; sum+=$2; printf "%s Recv-Q: %d max: %d min: %d ave: %.3f\n",$1,$2,max,min,sum/n;}'; Result: udp Recv-Q: 23624832 max: 47058176 min: 4352 ave: 25092687.698 I see max queue of 47MBytes, and worse an average standing queue of 25Mbytes, which is really bad for the latency seen by the application. And having this much outstanding memory is also bad for CPU cache size effects, and stressing the memory allocator. I'm actually using this huge queue "misconfig" to stress the page allocator and my page_pool implementation into worse case situations ;-) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On 08/31/2016 04:11 PM, Eric Dumazet wrote: On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: With regard to drops, are both of you sure you're using the same socket buffer sizes? Does it really matter ? At least at points in the past I have seen different drop counts at the SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis I was operating under at the time was that this dealt with those situations where the netserver was held-off from running for "a little while" from time to time. It didn't change things for a sustained overload situation though. In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? TCP_RR is driven by the network latency, we do not drop packets in the socket itself. I've been of the opinion it (single stream) is driven by path length. Sometimes by NIC latency. But then I'm almost always measuring in the LAN rather than across the WAN. happy benchmarking, rick
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: > With regard to drops, are both of you sure you're using the same socket > buffer sizes? Does it really matter ? I used the standard /proc/sys/net/core/rmem_default, but under flood receive queue is almost always full, even if you make it bigger. By varying its size, you only make batches bigger and number of context switches should be lower, if only two threads are competing for the cpu. Exact 'optimal' size would depend on various factors, depending on application and platform constraints. > > In the meantime, is anything interesting happening with TCP_RR or > TCP_STREAM? TCP_RR is driven by the network latency, we do not drop packets in the socket itself. TC_STREAM is normally paced by the ability of the receiver to send ACK packets. TCP has this auto regulating mode, unless the sender violates the RFC(s). If your question is : What happens if thousands of threads on the host want the cpu, and ksoftirqd gets not enough cycles by virtue of being a normal thread ? Then, you are back to typical provisioning problems, and normally people play with priorities and containers/cgroups, and/or various techniques like RPS/RFS (You can change ksoftirqd priority if you like)
Re: [PATCH] softirq: let ksoftirqd do its job
With regard to drops, are both of you sure you're using the same socket buffer sizes? In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? happy benchmarking, rick jones
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 2016-08-31 at 23:51 +0200, Jesper Dangaard Brouer wrote: > > The result from this run were handling 1,517,248 pps, without any > drops, all processes pinned to the same CPU. > > $ nstat > /dev/null && sleep 1 && nstat > #kernel > IpInReceives15172250.0 > IpInDelivers15172240.0 > UdpInDatagrams 15172480.0 > IpExtInOctets 69793408 0.0 > IpExtInNoECTPkts15172460.0 > > I'm acking this patch: > > Acked-by: Jesper Dangaard Brouer > Thanks a lot for bringing back the issue to me again, and all your tests !
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 31 Aug 2016 13:42:30 -0700 Eric Dumazet wrote: > On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote: > > > I can confirm the improvement of approx 900Kpps (no wonder people have > > been complaining about DoS against UDP/DNS servers). > > > > BUT during my extensive testing, of this patch, I also think that we > > have not gotten to the bottom of this. I was expecting to see a higher > > (collective) PPS number as I add more UDP servers, but I don't. > > > > Running many UDP netperf's with command: > > super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n > > -N > > Are you sure sender can send fast enough ? Yes, as I can see drops (overrun UDP limit UdpRcvbufErrors). Switching to pktgen and udp_sink to be sure. > > > > With 'top' I can see ksoftirq are still getting a higher %CPU time: > > > > PID %CPU TIME+ COMMAND > > 3 36.5 2:28.98 ksoftirqd/0 > > 107249.6 0:01.05 netserver > > 107229.3 0:01.05 netserver > > 107239.3 0:01.05 netserver > > 107259.3 0:01.05 netserver > > Looks much better on my machine, with "udprcv -n 4" (using 4 threads, > and 4 sockets using SO_REUSEPORT) > > 10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv > 3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 > ksoftirqd/0 > > Pressing 'H' in top gives : > > 3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 ksoftirqd/0 > 10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv > 10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv > 10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv > 10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv Yes, I'm seeing the same when unning 5 instances my own udp_sink[1]: sudo taskset -c 0 ./udp_sink --port 10003 --recvmsg --reuse-port --count $((10**10)) PID S %CPU TIME+ COMMAND 3 R 21.6 2:21.33 ksoftirqd/0 3838 R 15.9 0:02.18 udp_sink 3856 R 15.6 0:02.16 udp_sink 3862 R 15.6 0:02.16 udp_sink 3844 R 15.3 0:02.15 udp_sink 3850 S 15.3 0:02.15 udp_sink This is the expected result, that adding more userspace receivers scales up. I needed 5 udp_sink's before I don't see any drops, either this says the job performed by ksoftirqd is 5 times faster or the collective queue size of the programs was fast enough to absorb the scheduling jitter. The result from this run were handling 1,517,248 pps, without any drops, all processes pinned to the same CPU. $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives15172250.0 IpInDelivers15172240.0 UdpInDatagrams 15172480.0 IpExtInOctets 69793408 0.0 IpExtInNoECTPkts15172460.0 I'm acking this patch: Acked-by: Jesper Dangaard Brouer > > Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2 Mine on top of commit 84fd1b191a9468 > > > > > > > Since the load runs in well identified threads context, an admin can > > > more easily tune process scheduling parameters if needed. > > > > With this patch applied, I found that changing the UDP server process, > > scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost > > from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with > > a single UDP stream, also tested with more) > > > > Command used: > > sudo chrt --rr -p 20 $(pgrep netserver) > > > Sure, this is what I mentioned in my changelog : Once we properly > schedule and rely on ksoftirqd, tuning is available. > > > > > The scheduling picture also change a lot: > > > >PID %CPU TIME+ COMMAND > > 10783 24.3 0:21.53 netserver > > 10784 24.3 0:21.53 netserver > > 10785 24.3 0:21.52 netserver > > 10786 24.3 0:21.50 netserver > > 3 2.7 3:12.18 ksoftirqd/0 > > [1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 2016-08-31 at 21:40 +0200, Jesper Dangaard Brouer wrote: > I can confirm the improvement of approx 900Kpps (no wonder people have > been complaining about DoS against UDP/DNS servers). > > BUT during my extensive testing, of this patch, I also think that we > have not gotten to the bottom of this. I was expecting to see a higher > (collective) PPS number as I add more UDP servers, but I don't. > > Running many UDP netperf's with command: > super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N Are you sure sender can send fast enough ? > > With 'top' I can see ksoftirq are still getting a higher %CPU time: > > PID %CPU TIME+ COMMAND > 3 36.5 2:28.98 ksoftirqd/0 > 107249.6 0:01.05 netserver > 107229.3 0:01.05 netserver > 107239.3 0:01.05 netserver > 107259.3 0:01.05 netserver Looks much better on my machine, with "udprcv -n 4" (using 4 threads, and 4 sockets using SO_REUSEPORT) 10755 root 20 0 34948 4 0 S 79.7 0.0 0:33.66 udprcv 3 root 20 0 0 0 0 R 19.9 0.0 0:25.49 ksoftirqd/0 Pressing 'H' in top gives : 3 root 20 0 0 0 0 R 19.9 0.0 0:47.84 ksoftirqd/0 10756 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv 10757 root 20 0 34948 4 0 R 19.9 0.0 0:30.76 udprcv 10758 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv 10759 root 20 0 34948 4 0 S 19.9 0.0 0:30.76 udprcv Patch was on top of commit 071e31e254e0e0c438eecba3dba1d6e2d0da36c2 > > > > Since the load runs in well identified threads context, an admin can > > more easily tune process scheduling parameters if needed. > > With this patch applied, I found that changing the UDP server process, > scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost > from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with > a single UDP stream, also tested with more) > > Command used: > sudo chrt --rr -p 20 $(pgrep netserver) Sure, this is what I mentioned in my changelog : Once we properly schedule and rely on ksoftirqd, tuning is available. > > The scheduling picture also change a lot: > >PID %CPU TIME+ COMMAND > 10783 24.3 0:21.53 netserver > 10784 24.3 0:21.53 netserver > 10785 24.3 0:21.52 netserver > 10786 24.3 0:21.50 netserver > 3 2.7 3:12.18 ksoftirqd/0 > >
Re: [PATCH] softirq: let ksoftirqd do its job
On Wed, 31 Aug 2016 10:42:29 -0700 Eric Dumazet wrote: > From: Eric Dumazet > > A while back, Paolo and Hannes sent an RFC patch adding threaded-able > napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) > > The problem seems to be that softirqs are very aggressive and are often > handled by the current process, even if we are under stress and that > ksoftirqd was scheduled, so that innocent threads would have more chance > to make progress. > > This patch makes sure that if ksoftirq is running, we let it > perform the softirq work. > > Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/ > > Tested: > > - NIC receiving traffic handled by CPU 0 > - UDP receiver running on CPU 0, using a single UDP socket. > - Incoming flood of UDP packets targeting the UDP socket. > > Before the patch, the UDP receiver could almost never get cpu cycles and > could only receive ~2,000 packets per second. > > After the patch, cpu cycles are split 50/50 between user application and > ksoftirqd/0, and we can effectively read ~900,000 packets per second, > a huge improvement in DOS situation. (Note that more packets are now > dropped by the NIC itself, since the BH handlers get less cpu cycles to > drain RX ring buffer) I can confirm the improvement of approx 900Kpps (no wonder people have been complaining about DoS against UDP/DNS servers). BUT during my extensive testing, of this patch, I also think that we have not gotten to the bottom of this. I was expecting to see a higher (collective) PPS number as I add more UDP servers, but I don't. Running many UDP netperf's with command: super_netperf 4 -H 198.18.50.3 -l 120 -t UDP_STREAM -T 0,0 -- -m 1472 -n -N With 'top' I can see ksoftirq are still getting a higher %CPU time: PID %CPU TIME+ COMMAND 3 36.5 2:28.98 ksoftirqd/0 107249.6 0:01.05 netserver 107229.3 0:01.05 netserver 107239.3 0:01.05 netserver 107259.3 0:01.05 netserver > Since the load runs in well identified threads context, an admin can > more easily tune process scheduling parameters if needed. With this patch applied, I found that changing the UDP server process, scheduler policy to SCHED_RR or SCHED_FIFO gave me a performance boost from 900Kpps to 1.7Mpps, and not a single UDP packet dropped (even with a single UDP stream, also tested with more) Command used: sudo chrt --rr -p 20 $(pgrep netserver) The scheduling picture also change a lot: PID %CPU TIME+ COMMAND 10783 24.3 0:21.53 netserver 10784 24.3 0:21.53 netserver 10785 24.3 0:21.52 netserver 10786 24.3 0:21.50 netserver 3 2.7 3:12.18 ksoftirqd/0 > Reported-by: Paolo Abeni > Reported-by: Hannes Frederic Sowa > Signed-off-by: Eric Dumazet > Cc: David Miller Cc: Jesper Dangaard Brouer > Cc: Peter Zijlstra > Cc: Rik van Riel > --- > kernel/softirq.c | 16 +++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/kernel/softirq.c b/kernel/softirq.c > index 17caf4b63342..8ed90e3a88d6 100644 > --- a/kernel/softirq.c > +++ b/kernel/softirq.c > @@ -78,6 +78,17 @@ static void wakeup_softirqd(void) > } > > /* > + * If ksoftirqd is scheduled, we do not want to process pending softirqs > + * right now. Let ksoftirqd handle this at its own rate, to get fairness. > + */ > +static bool ksoftirqd_running(void) > +{ > + struct task_struct *tsk = __this_cpu_read(ksoftirqd); > + > + return tsk && (tsk->state == TASK_RUNNING); > +} > + > +/* > * preempt_count and SOFTIRQ_OFFSET usage: > * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving > * softirq processing. > @@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void) > > pending = local_softirq_pending(); > > - if (pending) > + if (pending && !ksoftirqd_running()) > do_softirq_own_stack(); > > local_irq_restore(flags); > @@ -340,6 +351,9 @@ void irq_enter(void) > > static inline void invoke_softirq(void) > { > + if (ksoftirqd_running()) > + return; > + > if (!force_irqthreads) { > #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK > /* > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] softirq: let ksoftirqd do its job
From: Eric Dumazet A while back, Paolo and Hannes sent an RFC patch adding threaded-able napi poll loop support : (https://patchwork.ozlabs.org/patch/620657/) The problem seems to be that softirqs are very aggressive and are often handled by the current process, even if we are under stress and that ksoftirqd was scheduled, so that innocent threads would have more chance to make progress. This patch makes sure that if ksoftirq is running, we let it perform the softirq work. Jonathan Corbet summarized the issue in https://lwn.net/Articles/687617/ Tested: - NIC receiving traffic handled by CPU 0 - UDP receiver running on CPU 0, using a single UDP socket. - Incoming flood of UDP packets targeting the UDP socket. Before the patch, the UDP receiver could almost never get cpu cycles and could only receive ~2,000 packets per second. After the patch, cpu cycles are split 50/50 between user application and ksoftirqd/0, and we can effectively read ~900,000 packets per second, a huge improvement in DOS situation. (Note that more packets are now dropped by the NIC itself, since the BH handlers get less cpu cycles to drain RX ring buffer) Since the load runs in well identified threads context, an admin can more easily tune process scheduling parameters if needed. Reported-by: Paolo Abeni Reported-by: Hannes Frederic Sowa Signed-off-by: Eric Dumazet Cc: David Miller Cc: Peter Zijlstra Cc: Rik van Riel --- kernel/softirq.c | 16 +++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 17caf4b63342..8ed90e3a88d6 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -78,6 +78,17 @@ static void wakeup_softirqd(void) } /* + * If ksoftirqd is scheduled, we do not want to process pending softirqs + * right now. Let ksoftirqd handle this at its own rate, to get fairness. + */ +static bool ksoftirqd_running(void) +{ + struct task_struct *tsk = __this_cpu_read(ksoftirqd); + + return tsk && (tsk->state == TASK_RUNNING); +} + +/* * preempt_count and SOFTIRQ_OFFSET usage: * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving * softirq processing. @@ -313,7 +324,7 @@ asmlinkage __visible void do_softirq(void) pending = local_softirq_pending(); - if (pending) + if (pending && !ksoftirqd_running()) do_softirq_own_stack(); local_irq_restore(flags); @@ -340,6 +351,9 @@ void irq_enter(void) static inline void invoke_softirq(void) { + if (ksoftirqd_running()) + return; + if (!force_irqthreads) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /*