Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Tue, Apr 20 2021 at 17:15, Lorenzo Colitti wrote: > On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner wrote: >> Enable tracing and enable the following tracepoints: >> [...] > > Sorry for the delay. I had to learn a bit about how to use the tracing > infrastructure. I don't know if I can post here, but to my untrained > eye, one big difference between the old (fast) code and the new (slow) > code is that the new code calls tick_program_event() much more. It > looks like that makes most of the difference. > > With the old code, hrtimer_start_range_ns almost never calls > tick_program_event at all, but the new code seems to call it twice on > every timer update. Yes, I figured out why that happens by now, but the old behaviour was just incorrect. So now there are clearly two issues: 1) hrtimer is contrary to timer_list not really suited for high frequency start/cancel/start/... cycles of a timer. It's optimized for the start and expire precisely case. 2) The cost for reprogramming depends on the underlying hardware. With halfways sane timer hardware it's minimal and as I measured in a micro benchmark in the 1% range. Now when your system ends up having one of the real timer trainwrecks which can be found in drivers/clocksource/ which are even worse than the x86 HPET horror, then it's going to be painful and a performance issue. I assume that's an ARM64 system. ARM64 CPUs have an architected per CPU timer where the reprogramming is pretty fast as it's next to the CPU, but who knows what your system is using. Now in the meantime I looked into __hrtimer_start_range_ns() whether that double reprogram can be avoided without creating a total trainwreck and imposing penalty on all sensible use cases. Completely untested patch below should do the trick and it's not ugly enough that I hate it with a passion. Even if that makes your problem go away #1 is still beyond suboptimal and #2 is something you really want to look into. Thanks, tglx --- --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti * remove hrtimer, called with base lock held */ static inline int -remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart) +remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, + bool restart, bool keep_local) { u8 state = timer->state; if (state & HRTIMER_STATE_ENQUEUED) { - int reprogram; + bool reprogram; /* * Remove the timer and force reprogramming when high @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st debug_deactivate(timer); reprogram = base->cpu_base == this_cpu_ptr(_bases); + /* +* If the timer is not restarted then reprogramming is +* required if the timer is local. If it is local and about +* to be restarted, avoid programming it twice (on removal +* and a moment later when it's requeued). +*/ if (!restart) state = HRTIMER_STATE_INACTIVE; + else + reprogram &= !keep_local; __remove_hrtimer(timer, base, state, reprogram); return 1; @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru struct hrtimer_clock_base *base) { struct hrtimer_clock_base *new_base; + bool force_local, first; - /* Remove an active timer from the queue: */ - remove_hrtimer(timer, base, true); + /* +* If the timer is on the local cpu base and is the first expiring +* timer then this might end up reprogramming the hardware twice +* (on removal and on enqueue). To avoid that by prevent the +* reprogram on removal, keep the timer local to the current CPU +* and enforce reprogramming after it is queued no matter whether +* it is the new first expiring timer again or not. +*/ + force_local = base->cpu_base == this_cpu_ptr(_bases); + force_local &= base->cpu_base->next_timer == timer; + + /* +* Remove an active timer from the queue. In case it is not queued +* on the current CPU, make sure that remove_hrtimer() updates the +* remote data correctly. +* +* If it's on the current CPU and the first expiring timer, then +* skip reprogramming, keep the timer local and enforce +* reprogramming later if it was the first expiring timer. This +* avoids programming the underlying clock event twice (once at +* removal and once after enqueue). +*/ + remove_hrtimer(timer, base, true, force_local); if (mode & HRTIMER_MODE_REL) tim = ktime_add_safe(tim,
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner wrote: > Enable tracing and enable the following tracepoints: > [...] Sorry for the delay. I had to learn a bit about how to use the tracing infrastructure. I don't know if I can post here, but to my untrained eye, one big difference between the old (fast) code and the new (slow) code is that the new code calls tick_program_event() much more. It looks like that makes most of the difference. With the old code, hrtimer_start_range_ns almost never calls tick_program_event at all, but the new code seems to call it twice on every timer update.
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Mon, Apr 19 2021 at 20:12, Maciej Żenczykowski wrote: > On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner wrote: >> Run the test on a kernels with and without that commit and collect trace >> data for both. >> >> That should give me a pretty clear picture what's going on. > > Lorenzo is trying to get the traces you asked for, or rather he’s > trying to get confirmation he can post them. > > Our initial observation of these results seems to suggest that > updating the timer (hrtimer_start, which seems to also call > hrtimer_cancel) takes twice as long as it used to. Which contradicts my measurements. The change in complexity is marginal and the overhead in cycles/instructions is close to noise. It's measurable in a microbenchmark, but it's in the < 1% range which is far away from the 60% you are seing. > My gut feeling is that softirq_activated is usually false, and the old > code in such a case calls just __hrtimer_get_next_event(, > HRTIMER_ACTIVE_ALL). While the new code will first call > __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then > __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD) > > Perhaps __hrtimer_get_next_event() should return both soft and hard > event times in one function call? > Or perhaps hrtimer_start should not call hrtimer_cancel? Perhaps we do a proper analysis first :) Thanks, tglx
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner wrote: > > On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote: > > On Wed, Apr 14, 2021 at 2:14 AM Greg KH wrote: > >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update > >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is > >> attached below. > >> > >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode > >> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should > >> not be happening, we can easily fix it but I guess no one has actually > >> had fast USB devices that noticed this until now :) > > > > AIUI the purpose of this timer is to support packet aggregation. USB > > transfers are relatively expensive/high latency. 6 Gbps is 500k > > 1500-byte packets per second, or one every 2us. So f_ncm buffers as > > many packets as will fit into 16k (usually, 10 1500-byte packets), and > > only initiates a USB transfer when those packets have arrived. That > > ends up doing only one transfer every 20us. It sets a 300us timer to > > ensure that if the 10 packets haven't arrived, it still sends out > > whatever it has when the timer fires. The timer is set/updated on > > every packet buffered by ncm. > > > > Is this somehow much more expensive in 5.10.24 than it was before? > > Even if this driver is somehow "holding it wrong", might there not be > > other workloads that have a similar problem? What about regressions on > > those workloads? > > Let's put the question of whether this hrtimer usage is sensible or not > aside for now. > > I stared at the change for a while and did some experiments to recreate > the problem, but that didn't get me anywhere. > > Could you please do the following? > > Enable tracing and enable the following tracepoints: > > timers/hrtimer_cancel > timers/hrtimer_start > timers/hrtimer_expire_entry > irq/softirq_raise > irq/softirq_enter > irq/softirq_exit > > and function tracing filtered on ncm_wrap_ntb() and > package_for_tx() only (to reduce the noise). > > Run the test on a kernels with and without that commit and collect trace > data for both. > > That should give me a pretty clear picture what's going on. Lorenzo is trying to get the traces you asked for, or rather he’s trying to get confirmation he can post them. Our initial observation of these results seems to suggest that updating the timer (hrtimer_start, which seems to also call hrtimer_cancel) takes twice as long as it used to. My gut feeling is that softirq_activated is usually false, and the old code in such a case calls just __hrtimer_get_next_event(, HRTIMER_ACTIVE_ALL). While the new code will first call __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD) Perhaps __hrtimer_get_next_event() should return both soft and hard event times in one function call? Or perhaps hrtimer_start should not call hrtimer_cancel? We've also been looking at the f_ncm driver itself, and trying to reduce the number of hrtimer_cancel/start calls. It's pretty easy to reduce this by a factor of 10x (we’re not yet 100% certain this is race free), which does drastically improve performance. However, it still only takes the regression from 60% to 20%. - Maciej
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote: > On Wed, Apr 14, 2021 at 2:14 AM Greg KH wrote: >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is >> attached below. >> >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode >> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should >> not be happening, we can easily fix it but I guess no one has actually >> had fast USB devices that noticed this until now :) > > AIUI the purpose of this timer is to support packet aggregation. USB > transfers are relatively expensive/high latency. 6 Gbps is 500k > 1500-byte packets per second, or one every 2us. So f_ncm buffers as > many packets as will fit into 16k (usually, 10 1500-byte packets), and > only initiates a USB transfer when those packets have arrived. That > ends up doing only one transfer every 20us. It sets a 300us timer to > ensure that if the 10 packets haven't arrived, it still sends out > whatever it has when the timer fires. The timer is set/updated on > every packet buffered by ncm. > > Is this somehow much more expensive in 5.10.24 than it was before? > Even if this driver is somehow "holding it wrong", might there not be > other workloads that have a similar problem? What about regressions on > those workloads? Let's put the question of whether this hrtimer usage is sensible or not aside for now. I stared at the change for a while and did some experiments to recreate the problem, but that didn't get me anywhere. Could you please do the following? Enable tracing and enable the following tracepoints: timers/hrtimer_cancel timers/hrtimer_start timers/hrtimer_expire_entry irq/softirq_raise irq/softirq_enter irq/softirq_exit and function tracing filtered on ncm_wrap_ntb() and package_for_tx() only (to reduce the noise). Run the test on a kernels with and without that commit and collect trace data for both. That should give me a pretty clear picture what's going on. Thanks, tglx
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Wed, Apr 14, 2021 at 2:14 AM Greg KH wrote: > To give context, the commit is now 46eb1701c046 ("hrtimer: Update > softirq_expires_next correctly after __hrtimer_get_next_event()") and is > attached below. > > The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode > HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should > not be happening, we can easily fix it but I guess no one has actually > had fast USB devices that noticed this until now :) AIUI the purpose of this timer is to support packet aggregation. USB transfers are relatively expensive/high latency. 6 Gbps is 500k 1500-byte packets per second, or one every 2us. So f_ncm buffers as many packets as will fit into 16k (usually, 10 1500-byte packets), and only initiates a USB transfer when those packets have arrived. That ends up doing only one transfer every 20us. It sets a 300us timer to ensure that if the 10 packets haven't arrived, it still sends out whatever it has when the timer fires. The timer is set/updated on every packet buffered by ncm. Is this somehow much more expensive in 5.10.24 than it was before? Even if this driver is somehow "holding it wrong", might there not be other workloads that have a similar problem? What about regressions on those workloads?
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
On Tue, Apr 13, 2021 at 09:55:08AM -0700, Maciej Żenczykowski wrote: > This patch (or at least the version of it that showed up in 5.10.24 > LTS when combined with Android Common Kernel and some arm phone > platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps > down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a > SuperSpeed+ 10gbps USB3 connection. > > The regression is not there in 5.10.23, and is present in 5.10.24 and > 5.10.26. Reverting just this one patch is confirmed to restore > performance (both on 5.10.24 and 5.10.26). > > We don't know the cause, as we know nothing about hrtimers... but we > lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not > firing as often as it should... > > Unfortunately I have no idea how to replicate this on commonly > available hardware (or with a pure stable or 5.11/5.12 Linus tree) > since it requires a gadget capable usb3.1 10gbps controller (which I > only have access to in combination with a 5.10-based arm+dwc3 soc). > > (the regression is visible with just usb3.0, but it's smaller due to > usb3.0 topping out at just under 4gbps, though it still drops to > 2.2gbps -- and this still doesn't help since usb3 gadget capable > controllers are nearly as rare) > To give context, the commit is now 46eb1701c046 ("hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()") and is attached below. The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should not be happening, we can easily fix it but I guess no one has actually had fast USB devices that noticed this until now :) thanks, greg k-h -- commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4 Author: Anna-Maria Behnsen Date: Tue Feb 23 17:02:40 2021 +0100 hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers") Reported-by: Mikael Beckius Suggested-by: Thomas Gleixner Tested-by: Mikael Beckius Signed-off-by: Anna-Maria Behnsen Signed-off-by: Thomas Gleixner Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20210223160240.27518-1-anna-ma...@linutronix.de diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 743c852e10f2..788b9d137de4 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base, } /* - * Recomputes cpu_base::*next_timer and returns the earliest expires_next but - * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.
Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()
This patch (or at least the version of it that showed up in 5.10.24 LTS when combined with Android Common Kernel and some arm phone platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a SuperSpeed+ 10gbps USB3 connection. The regression is not there in 5.10.23, and is present in 5.10.24 and 5.10.26. Reverting just this one patch is confirmed to restore performance (both on 5.10.24 and 5.10.26). We don't know the cause, as we know nothing about hrtimers... but we lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not firing as often as it should... Unfortunately I have no idea how to replicate this on commonly available hardware (or with a pure stable or 5.11/5.12 Linus tree) since it requires a gadget capable usb3.1 10gbps controller (which I only have access to in combination with a 5.10-based arm+dwc3 soc). (the regression is visible with just usb3.0, but it's smaller due to usb3.0 topping out at just under 4gbps, though it still drops to 2.2gbps -- and this still doesn't help since usb3 gadget capable controllers are nearly as rare) - Maciej & Lorenzo