Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-20 Thread Thomas Gleixner
On Tue, Apr 20 2021 at 17:15, Lorenzo Colitti wrote:
> On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner  wrote:
>> Enable tracing and enable the following tracepoints:
>> [...]
>
> Sorry for the delay. I had to learn a bit about how to use the tracing
> infrastructure. I don't know if I can post here, but to my untrained
> eye, one big difference between the old (fast) code and the new (slow)
> code is that the new code calls tick_program_event() much more. It
> looks like that makes most of the difference.
>
> With the old code, hrtimer_start_range_ns almost never calls
> tick_program_event at all, but the new code seems to call it twice on
> every timer update.

Yes, I figured out why that happens by now, but the old behaviour was
just incorrect. So now there are clearly two issues:

  1) hrtimer is contrary to timer_list not really suited for high
 frequency start/cancel/start/... cycles of a timer. It's optimized
 for the start and expire precisely case.

  2) The cost for reprogramming depends on the underlying hardware. With
 halfways sane timer hardware it's minimal and as I measured in a
 micro benchmark in the 1% range. Now when your system ends up
 having one of the real timer trainwrecks which can be found in
 drivers/clocksource/ which are even worse than the x86 HPET horror,
 then it's going to be painful and a performance issue.

 I assume that's an ARM64 system. ARM64 CPUs have an architected per
 CPU timer where the reprogramming is pretty fast as it's next to
 the CPU, but who knows what your system is using.

Now in the meantime I looked into __hrtimer_start_range_ns() whether
that double reprogram can be avoided without creating a total trainwreck
and imposing penalty on all sensible use cases. Completely untested
patch below should do the trick and it's not ugly enough that I hate it
with a passion.

Even if that makes your problem go away #1 is still beyond suboptimal
and #2 is something you really want to look into.

Thanks,

tglx
---
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti
  * remove hrtimer, called with base lock held
  */
 static inline int
-remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool 
restart)
+remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base,
+  bool restart, bool keep_local)
 {
u8 state = timer->state;
 
if (state & HRTIMER_STATE_ENQUEUED) {
-   int reprogram;
+   bool reprogram;
 
/*
 * Remove the timer and force reprogramming when high
@@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st
debug_deactivate(timer);
reprogram = base->cpu_base == this_cpu_ptr(_bases);
 
+   /*
+* If the timer is not restarted then reprogramming is
+* required if the timer is local. If it is local and about
+* to be restarted, avoid programming it twice (on removal
+* and a moment later when it's requeued).
+*/
if (!restart)
state = HRTIMER_STATE_INACTIVE;
+   else
+   reprogram &= !keep_local;
 
__remove_hrtimer(timer, base, state, reprogram);
return 1;
@@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru
struct hrtimer_clock_base *base)
 {
struct hrtimer_clock_base *new_base;
+   bool force_local, first;
 
-   /* Remove an active timer from the queue: */
-   remove_hrtimer(timer, base, true);
+   /*
+* If the timer is on the local cpu base and is the first expiring
+* timer then this might end up reprogramming the hardware twice
+* (on removal and on enqueue). To avoid that by prevent the
+* reprogram on removal, keep the timer local to the current CPU
+* and enforce reprogramming after it is queued no matter whether
+* it is the new first expiring timer again or not.
+*/
+   force_local = base->cpu_base == this_cpu_ptr(_bases);
+   force_local &= base->cpu_base->next_timer == timer;
+
+   /*
+* Remove an active timer from the queue. In case it is not queued
+* on the current CPU, make sure that remove_hrtimer() updates the
+* remote data correctly.
+*
+* If it's on the current CPU and the first expiring timer, then
+* skip reprogramming, keep the timer local and enforce
+* reprogramming later if it was the first expiring timer.  This
+* avoids programming the underlying clock event twice (once at
+* removal and once after enqueue).
+*/
+   remove_hrtimer(timer, base, true, force_local);
 
if (mode & HRTIMER_MODE_REL)
tim = ktime_add_safe(tim, 

Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-20 Thread Lorenzo Colitti
On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner  wrote:
> Enable tracing and enable the following tracepoints:
> [...]

Sorry for the delay. I had to learn a bit about how to use the tracing
infrastructure. I don't know if I can post here, but to my untrained
eye, one big difference between the old (fast) code and the new (slow)
code is that the new code calls tick_program_event() much more. It
looks like that makes most of the difference.

With the old code, hrtimer_start_range_ns almost never calls
tick_program_event at all, but the new code seems to call it twice on
every timer update.


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-20 Thread Thomas Gleixner
On Mon, Apr 19 2021 at 20:12, Maciej Żenczykowski wrote:
> On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner  wrote:
>> Run the test on a kernels with and without that commit and collect trace
>> data for both.
>>
>> That should give me a pretty clear picture what's going on.
>
> Lorenzo is trying to get the traces you asked for, or rather he’s
> trying to get confirmation he can post them.
>
> Our initial observation of these results seems to suggest that
> updating the timer (hrtimer_start, which seems to also call
> hrtimer_cancel) takes twice as long as it used to.

Which contradicts my measurements. The change in complexity is marginal
and the overhead in cycles/instructions is close to noise. It's
measurable in a microbenchmark, but it's in the < 1% range which is far
away from the 60% you are seing.

> My gut feeling is that softirq_activated is usually false, and the old
> code in such a case calls just __hrtimer_get_next_event(,
> HRTIMER_ACTIVE_ALL).  While the new code will first call
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
> __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)
>
> Perhaps __hrtimer_get_next_event() should return both soft and hard
> event times in one function call?
> Or perhaps hrtimer_start should not call hrtimer_cancel?

Perhaps we do a proper analysis first :)

Thanks,

tglx


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-19 Thread Maciej Żenczykowski
On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner  wrote:
>
> On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> > On Wed, Apr 14, 2021 at 2:14 AM Greg KH  wrote:
> >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
> >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
> >> attached below.
> >>
> >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
> >> HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
> >> not be happening, we can easily fix it but I guess no one has actually
> >> had fast USB devices that noticed this until now :)
> >
> > AIUI the purpose of this timer is to support packet aggregation. USB
> > transfers are relatively expensive/high latency. 6 Gbps is 500k
> > 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> > many packets as will fit into 16k (usually, 10 1500-byte packets), and
> > only initiates a USB transfer when those packets have arrived. That
> > ends up doing only one transfer every 20us. It sets a 300us timer to
> > ensure that if the 10 packets haven't arrived, it still sends out
> > whatever it has when the timer fires. The timer is set/updated on
> > every packet buffered by ncm.
> >
> > Is this somehow much more expensive in 5.10.24 than it was before?
> > Even if this driver is somehow "holding it wrong", might there not be
> > other workloads that have a similar problem? What about regressions on
> > those workloads?
>
> Let's put the question of whether this hrtimer usage is sensible or not
> aside for now.
>
> I stared at the change for a while and did some experiments to recreate
> the problem, but that didn't get me anywhere.
>
> Could you please do the following?
>
> Enable tracing and enable the following tracepoints:
>
> timers/hrtimer_cancel
> timers/hrtimer_start
> timers/hrtimer_expire_entry
> irq/softirq_raise
> irq/softirq_enter
> irq/softirq_exit
>
> and function tracing filtered on ncm_wrap_ntb() and
> package_for_tx() only (to reduce the noise).
>
> Run the test on a kernels with and without that commit and collect trace
> data for both.
>
> That should give me a pretty clear picture what's going on.

Lorenzo is trying to get the traces you asked for, or rather he’s
trying to get confirmation he can post them.

Our initial observation of these results seems to suggest that
updating the timer (hrtimer_start, which seems to also call
hrtimer_cancel) takes twice as long as it used to.

My gut feeling is that softirq_activated is usually false, and the old
code in such a case calls just __hrtimer_get_next_event(,
HRTIMER_ACTIVE_ALL).  While the new code will first call
__hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then
__hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD)

Perhaps __hrtimer_get_next_event() should return both soft and hard
event times in one function call?
Or perhaps hrtimer_start should not call hrtimer_cancel?

We've also been looking at the f_ncm driver itself, and trying to
reduce the number of hrtimer_cancel/start calls.
It's pretty easy to reduce this by a factor of 10x (we’re not yet 100%
certain this is race free), which does drastically improve
performance.
However, it still only takes the regression from 60% to 20%.

- Maciej


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-15 Thread Thomas Gleixner
On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote:
> On Wed, Apr 14, 2021 at 2:14 AM Greg KH  wrote:
>> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
>> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
>> attached below.
>>
>> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
>> HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
>> not be happening, we can easily fix it but I guess no one has actually
>> had fast USB devices that noticed this until now :)
>
> AIUI the purpose of this timer is to support packet aggregation. USB
> transfers are relatively expensive/high latency. 6 Gbps is 500k
> 1500-byte packets per second, or one every 2us. So f_ncm buffers as
> many packets as will fit into 16k (usually, 10 1500-byte packets), and
> only initiates a USB transfer when those packets have arrived. That
> ends up doing only one transfer every 20us. It sets a 300us timer to
> ensure that if the 10 packets haven't arrived, it still sends out
> whatever it has when the timer fires. The timer is set/updated on
> every packet buffered by ncm.
>
> Is this somehow much more expensive in 5.10.24 than it was before?
> Even if this driver is somehow "holding it wrong", might there not be
> other workloads that have a similar problem? What about regressions on
> those workloads?

Let's put the question of whether this hrtimer usage is sensible or not
aside for now.

I stared at the change for a while and did some experiments to recreate
the problem, but that didn't get me anywhere.

Could you please do the following?

Enable tracing and enable the following tracepoints:

timers/hrtimer_cancel
timers/hrtimer_start
timers/hrtimer_expire_entry
irq/softirq_raise
irq/softirq_enter
irq/softirq_exit

and function tracing filtered on ncm_wrap_ntb() and
package_for_tx() only (to reduce the noise).

Run the test on a kernels with and without that commit and collect trace
data for both.

That should give me a pretty clear picture what's going on.

Thanks,

tglx


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-13 Thread Lorenzo Colitti
On Wed, Apr 14, 2021 at 2:14 AM Greg KH  wrote:
> To give context, the commit is now 46eb1701c046 ("hrtimer: Update
> softirq_expires_next correctly after __hrtimer_get_next_event()") and is
> attached below.
>
> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
> HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
> not be happening, we can easily fix it but I guess no one has actually
> had fast USB devices that noticed this until now :)

AIUI the purpose of this timer is to support packet aggregation. USB
transfers are relatively expensive/high latency. 6 Gbps is 500k
1500-byte packets per second, or one every 2us. So f_ncm buffers as
many packets as will fit into 16k (usually, 10 1500-byte packets), and
only initiates a USB transfer when those packets have arrived. That
ends up doing only one transfer every 20us. It sets a 300us timer to
ensure that if the 10 packets haven't arrived, it still sends out
whatever it has when the timer fires. The timer is set/updated on
every packet buffered by ncm.

Is this somehow much more expensive in 5.10.24 than it was before?
Even if this driver is somehow "holding it wrong", might there not be
other workloads that have a similar problem? What about regressions on
those workloads?


Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-13 Thread Greg KH
On Tue, Apr 13, 2021 at 09:55:08AM -0700, Maciej Żenczykowski wrote:
> This patch (or at least the version of it that showed up in 5.10.24
> LTS when combined with Android Common Kernel and some arm phone
> platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps
> down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a
> SuperSpeed+ 10gbps USB3 connection.
> 
> The regression is not there in 5.10.23, and is present in 5.10.24 and
> 5.10.26.  Reverting just this one patch is confirmed to restore
> performance (both on 5.10.24 and 5.10.26).
> 
> We don't know the cause, as we know nothing about hrtimers... but we
> lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not
> firing as often as it should...
> 
> Unfortunately I have no idea how to replicate this on commonly
> available hardware (or with a pure stable or 5.11/5.12 Linus tree)
> since it requires a gadget capable usb3.1 10gbps controller (which I
> only have access to in combination with a 5.10-based arm+dwc3 soc).
> 
> (the regression is visible with just usb3.0, but it's smaller due to
> usb3.0 topping out at just under 4gbps, though it still drops to
> 2.2gbps -- and this still doesn't help since usb3 gadget capable
> controllers are nearly as rare)
>

To give context, the commit is now 46eb1701c046 ("hrtimer: Update
softirq_expires_next correctly after __hrtimer_get_next_event()") and is
attached below.

The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode
HRTIMER_MODE_REL_SOFT for I think every packet it gets.  If that should
not be happening, we can easily fix it but I guess no one has actually
had fast USB devices that noticed this until now :)

thanks,

greg k-h

--

commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4
Author: Anna-Maria Behnsen 
Date:   Tue Feb 23 17:02:40 2021 +0100

hrtimer: Update softirq_expires_next correctly after 
__hrtimer_get_next_event()

hrtimer_force_reprogram() and hrtimer_interrupt() invokes
__hrtimer_get_next_event() to find the earliest expiry time of hrtimer
bases. __hrtimer_get_next_event() does not update
cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That
needs to be done at the callsites.

hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when
the first expiring timer is a softirq timer and the soft interrupt is not
activated. That's wrong because cpu_base::softirq_expires_next is left
stale when the first expiring timer of all bases is a timer which expires
in hard interrupt context. hrtimer_interrupt() does never update
cpu_base::softirq_expires_next which is wrong too.

That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and
the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting
CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that
timer before the stale cpu_base::softirq_expires_next.

cpu_base::softirq_expires_next is cached to make the check for raising the
soft interrupt fast. In the above case the soft interrupt won't be raised
until clock monotonic reaches the stale cpu_base::softirq_expires_next
value. That's incorrect, but what's worse it that if the softirq timer
becomes the first expiring timer of all clock bases after the hard expiry
timer has been handled the reprogramming of the clockevent from
hrtimer_interrupt() will result in an interrupt storm. That happens because
the reprogramming does not use cpu_base::softirq_expires_next, it uses
__hrtimer_get_next_event() which returns the actual expiry time. Once clock
MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is
raised and the storm subsides.

Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard
bases seperately, update softirq_expires_next and handle the case when a
soft expiring timer is the first of all bases by comparing the expiry times
and updating the required cpu base fields. Split this functionality into a
separate function to be able to use it in hrtimer_interrupt() as well
without copy paste.

Fixes: 5da70160462e ("hrtimer: Implement support for softirq based 
hrtimers")
Reported-by: Mikael Beckius 
Suggested-by: Thomas Gleixner 
Tested-by: Mikael Beckius 
Signed-off-by: Anna-Maria Behnsen 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Ingo Molnar 
Link: 
https://lore.kernel.org/r/20210223160240.27518-1-anna-ma...@linutronix.de

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 743c852e10f2..788b9d137de4 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct 
hrtimer_cpu_base *cpu_base,
 }
 
 /*
- * Recomputes cpu_base::*next_timer and returns the earliest expires_next but
- * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram.

Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()

2021-04-13 Thread Maciej Żenczykowski
This patch (or at least the version of it that showed up in 5.10.24
LTS when combined with Android Common Kernel and some arm phone
platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps
down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a
SuperSpeed+ 10gbps USB3 connection.

The regression is not there in 5.10.23, and is present in 5.10.24 and
5.10.26.  Reverting just this one patch is confirmed to restore
performance (both on 5.10.24 and 5.10.26).

We don't know the cause, as we know nothing about hrtimers... but we
lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not
firing as often as it should...

Unfortunately I have no idea how to replicate this on commonly
available hardware (or with a pure stable or 5.11/5.12 Linus tree)
since it requires a gadget capable usb3.1 10gbps controller (which I
only have access to in combination with a 5.10-based arm+dwc3 soc).

(the regression is visible with just usb3.0, but it's smaller due to
usb3.0 topping out at just under 4gbps, though it still drops to
2.2gbps -- and this still doesn't help since usb3 gadget capable
controllers are nearly as rare)

- Maciej & Lorenzo