from:"Thomas Gleixner"

Re: [patch 1/1] genirq: Disable interrupts for force threaded handlers

2021-03-17 Thread Thomas Gleixner

On Wed, Mar 17 2021 at 15:48, Sebastian Andrzej Siewior wrote:
> On 2021-03-17 15:38:52 [+0100], Thomas Gleixner wrote:
>> thread(irq_A)
>>   irq_handler(A)
>> spin_lock(&foo->lock);
>> 
>> interrupt(irq_B)
>>   irq_handler(B)
>> spin_lock(&foo->lock);
>
> It will not because both threads will wake_up(thread). It is an issue if
> - if &foo->lock is shared between a hrtimer and threaded-IRQ
> - if &foo->lock is shared between a non-threaded and thread-IRQ
> - if &foo->lock is shared between a printk() in hardirq context and
>   thread-IRQ as I learned today.

That's the point and it's entirely clear from the above: A is thread
context and B is hard interrupt context and if the lock is shared then
it's busted. Otherwise we would not have this discussion at all.

[patch 1/1] genirq: Disable interrupts for force threaded handlers

2021-03-17 Thread Thomas Gleixner

With interrupt force threading all device interrupt handlers are invoked
from kernel threads. Contrary to hard interrupt context the invocation only
disables bottom halfs, but not interrupts. This was an oversight back then
because any code like this will have an issue:

thread(irq_A)
  irq_handler(A)
spin_lock(&foo->lock);

interrupt(irq_B)
  irq_handler(B)
spin_lock(&foo->lock);

This has been triggered with networking (NAPI vs. hrtimers) and console
drivers where printk() happens from an interrupt which interrupted the
force threaded handler.

Now people noticed and started to change the spin_lock() in the handler to
spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the
interrupt request which in turn breaks RT.

Fix the root cause and not the symptom and disable interrupts before
invoking the force threaded handler which preserves the regular semantics
and the usefulness of the interrupt force threading as a general debugging
tool.

For not RT this is not changing much, except that during the execution of
the threaded handler interrupts are delayed until the handler
returns. Vs. scheduling and softirq processing there is no difference.

For RT kernels there is no issue.

Fixes: 8d32a307e4fa ("genirq: Provide forced interrupt threading")
Reported-by: Johan Hovold 
Signed-off-by: Thomas Gleixner 
Cc: Eric Dumazet 
Cc: Sebastian Andrzej Siewior 
Cc: netdev 
Cc: "David S. Miller" 
Cc: Krzysztof Kozlowski 
Cc: Greg Kroah-Hartman 
Cc: Andy Shevchenko 
CC: Peter Zijlstra 
Cc: linux-ser...@vger.kernel.org
Cc: netdev 
---
 kernel/irq/manage.c |4 
 1 file changed, 4 insertions(+)

--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1142,11 +1142,15 @@ irq_forced_thread_fn(struct irq_desc *de
irqreturn_t ret;
 
local_bh_disable();
+   if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+   local_irq_disable();
ret = action->thread_fn(action->irq, action->dev_id);
if (ret == IRQ_HANDLED)
atomic_inc(&desc->threads_handled);
 
irq_finalize_oneshot(desc, action);
+   if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+   local_irq_enable();
local_bh_enable();
return ret;
 }

Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'

2021-03-17 Thread Thomas Gleixner

On Thu, Apr 16 2020 at 15:59, Sebastian Andrzej Siewior wrote:
> any comments from the timer department?

Yes.

> On 2019-11-27 18:37:19 [+0100], To Eric Dumazet wrote:
>> On 2019-11-27 09:11:40 [-0800], Eric Dumazet wrote:
>> > Resent in non HTML mode :/
>> don't worry, mutt handles both :)
>> 
>> > Long story short, why hrtimer are not by default using threaded mode
>> > in threadirqs mode ?
>> 
>> Because it is only documented to thread only interrupts. Not sure if we
>> want change this.
>> In RT we expire most of the hrtimers in softirq context for other
>> reasons. A subset of them still expire in hardirq context.
>>
>> > Idea of having some (but not all of them) hard irq handlers' now being
>> > run from BH mode,
>> > is rather scary.
>> 
>> As I explained in my previous email: All IRQ-handlers fire in
>> threaded-mode if enabled. Only the hrtimer is not affected by this
>> change.
>> 
>> > Also, hrtimers got the SOFT thing only in 4.16, while the GRO patch
>> > went in linux-3.19
>> > 
>> > What would be the plan for stable trees ?
>> No idea yet. We could let __napi_schedule_irqoff() behave like
>> __napi_schedule(). 

It's not really a timer departement problem. It's an interrupt problem.

With force threaded interrupts we don't call the handler with interrupts
disabled. What sounded a good idea long ago, is actually bad.

See https://lore.kernel.org/r/87eegdzzez@nanos.tec.linutronix.de

Any leftover issues on a RT kernel are a different story, but for !RT
this is the proper fix.

I'll spin up a proper patch and tag it for stable...

Thanks,

tglx

Re: [patch 07/14] tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RT

2021-03-10 Thread Thomas Gleixner

On Tue, Mar 09 2021 at 16:21, Sebastian Andrzej Siewior wrote:

> On 2021-03-09 16:00:37 [+0100], To Thomas Gleixner wrote:
>> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
>> index 07c7329d21aa7..1c14ccd351091 100644
>> --- a/include/linux/interrupt.h
>> +++ b/include/linux/interrupt.h
>> @@ -663,15 +663,6 @@ static inline int tasklet_trylock(struct tasklet_struct 
>> *t)
>>  void tasklet_unlock(struct tasklet_struct *t);
>>  void tasklet_unlock_wait(struct tasklet_struct *t);
>>  
>> -/*
>> - * Do not use in new code. Waiting for tasklets from atomic contexts is
>> - * error prone and should be avoided.
>> - */
>> -static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t)
>> -{
>> -while (test_bit(TASKLET_STATE_RUN, &t->state))
>> -cpu_relax();
>> -}
>
> Look at that. The forward declaration for tasklet_unlock_spin_wait()
> should have remained. Sorry for that.

No idea how I managed to mess that up and fail to notice. Brown
paperbags to the rescue.

Thanks,

tglx

[patch 06/14] tasklets: Replace spin wait in tasklet_kill()

2021-03-09 Thread Thomas Gleixner

From: Peter Zijlstra 

tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking
yield() from inside the loop. yield() is an ill defined mechanism and the
result might still be wasting CPU cycles in a tight loop which is
especially painful in a guest when the CPU running the tasklet is scheduled
out.

tasklet_kill() is used in teardown paths and not performance critical at
all. Replace the spin wait with wait_var_event().

Signed-off-by: Peter Zijlstra 
Signed-off-by: Thomas Gleixner 
---
 kernel/softirq.c |   23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -532,6 +532,16 @@ void __tasklet_hi_schedule(struct taskle
 }
 EXPORT_SYMBOL(__tasklet_hi_schedule);
 
+static inline bool tasklet_clear_sched(struct tasklet_struct *t)
+{
+   if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) {
+   wake_up_var(&t->state);
+   return true;
+   }
+
+   return false;
+}
+
 static void tasklet_action_common(struct softirq_action *a,
  struct tasklet_head *tl_head,
  unsigned int softirq_nr)
@@ -551,8 +561,7 @@ static void tasklet_action_common(struct
 
if (tasklet_trylock(t)) {
if (!atomic_read(&t->count)) {
-   if (!test_and_clear_bit(TASKLET_STATE_SCHED,
-   &t->state))
+   if (!tasklet_clear_sched(t))
BUG();
if (t->use_callback)
t->callback(t);
@@ -612,13 +621,11 @@ void tasklet_kill(struct tasklet_struct
if (in_interrupt())
pr_notice("Attempt to kill tasklet from interrupt\n");
 
-   while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) {
-   do {
-   yield();
-   } while (test_bit(TASKLET_STATE_SCHED, &t->state));
-   }
+   while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
+   wait_var_event(&t->state, !test_bit(TASKLET_STATE_SCHED, 
&t->state));
+
tasklet_unlock_wait(t);
-   clear_bit(TASKLET_STATE_SCHED, &t->state);
+   tasklet_clear_sched(t);
 }
 EXPORT_SYMBOL(tasklet_kill);

[patch 14/14] tasklets: Switch tasklet_disable() to the sleep wait variant

2021-03-09 Thread Thomas Gleixner

 -- NOT FOR IMMEDIATE MERGING --

Now that all users of tasklet_disable() are invoked from sleepable context,
convert it to use tasklet_unlock_wait() which might sleep.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -716,8 +716,7 @@ static inline void tasklet_disable_in_at
 static inline void tasklet_disable(struct tasklet_struct *t)
 {
tasklet_disable_nosync(t);
-   /* Spin wait until all atomic users are converted */
-   tasklet_unlock_spin_wait(t);
+   tasklet_unlock_wait(t);
smp_mb();
 }

[patch 12/14] PCI: hv: Use tasklet_disable_in_atomic()

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is
invoked via irq_chip_compose_msi_msg(), which itself is always invoked from
atomic contexts from the guts of the interrupt core code.

There is no way to change this w/o rewriting the whole driver, so use
tasklet_disable_in_atomic() which allows to make tasklet_disable()
sleepable once the remaining atomic users are addressed.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Wei Liu 
Cc: Lorenzo Pieralisi 
Cc: Rob Herring 
Cc: Bjorn Helgaas 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-...@vger.kernel.org
---
 drivers/pci/controller/pci-hyperv.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1458,7 +1458,7 @@ static void hv_compose_msi_msg(struct ir
 * Prevents hv_pci_onchannelcallback() from running concurrently
 * in the tasklet.
 */
-   tasklet_disable(&channel->callback_event);
+   tasklet_disable_in_atomic(&channel->callback_event);
 
/*
 * Since this function is called with IRQ locks held, can't

[patch 13/14] firewire: ohci: Use tasklet_disable_in_atomic() where required

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

tasklet_disable() is invoked in several places. Some of them are in atomic
context which prevents a conversion of tasklet_disable() to a sleepable
function.

The atomic callchains are:

 ar_context_tasklet()
   ohci_cancel_packet()
 tasklet_disable()

 ...
   ohci_flush_iso_completions()
 tasklet_disable()

The invocation of tasklet_disable() from at_context_flush() is always in
preemptible context.

Use tasklet_disable_in_atomic() for the two invocations in
ohci_cancel_packet() and ohci_flush_iso_completions().

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Stefan Richter 
Cc: linux1394-de...@lists.sourceforge.net
---
 drivers/firewire/ohci.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/firewire/ohci.c
+++ b/drivers/firewire/ohci.c
@@ -2545,7 +2545,7 @@ static int ohci_cancel_packet(struct fw_
struct driver_data *driver_data = packet->driver_data;
int ret = -ENOENT;
 
-   tasklet_disable(&ctx->tasklet);
+   tasklet_disable_in_atomic(&ctx->tasklet);
 
if (packet->ack != 0)
goto out;
@@ -3465,7 +3465,7 @@ static int ohci_flush_iso_completions(st
struct iso_context *ctx = container_of(base, struct iso_context, base);
int ret = 0;
 
-   tasklet_disable(&ctx->context.tasklet);
+   tasklet_disable_in_atomic(&ctx->context.tasklet);
 
if (!test_and_set_bit_lock(0, &ctx->flushing_completions)) {
context_tasklet((unsigned long)&ctx->context);

[patch 11/14] atm: eni: Use tasklet_disable_in_atomic() in the send() callback

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

The atmdev_ops::send callback which calls tasklet_disable() is invoked with
bottom halfs disabled from net_device_ops::ndo_start_xmit(). All other
invocations of tasklet_disable() in this driver happen in preemptible
context.

Change the send() call to use tasklet_disable_in_atomic() which allows
tasklet_disable() to be made sleepable once the remaining atomic context
usage sites are cleaned up.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Chas Williams <3ch...@gmail.com>
Cc: linux-atm-gene...@lists.sourceforge.net
Cc: netdev@vger.kernel.org
---
 drivers/atm/eni.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/atm/eni.c
+++ b/drivers/atm/eni.c
@@ -2054,7 +2054,7 @@ static int eni_send(struct atm_vcc *vcc,
}
submitted++;
ATM_SKB(skb)->vcc = vcc;
-   tasklet_disable(&ENI_DEV(vcc->dev)->task);
+   tasklet_disable_in_atomic(&ENI_DEV(vcc->dev)->task);
res = do_tx(skb);
tasklet_enable(&ENI_DEV(vcc->dev)->task);
if (res == enq_ok) return 0;

[patch 10/14] ath9k: Use tasklet_disable_in_atomic()

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

All callers of ath9k_beacon_ensure_primary_slot() are preemptible /
acquire a mutex except for this callchain:

  spin_lock_bh(&sc->sc_pcu_lock);
  ath_complete_reset()
  -> ath9k_calculate_summary_state()
 -> ath9k_beacon_ensure_primary_slot()

It's unclear how that can be distangled, so use tasklet_disable_in_atomic()
for now. This allows tasklet_disable() to become sleepable once the
remaining atomic users are cleaned up.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: ath9k-de...@qca.qualcomm.com
Cc: Kalle Valo 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 drivers/net/wireless/ath/ath9k/beacon.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/wireless/ath/ath9k/beacon.c
+++ b/drivers/net/wireless/ath/ath9k/beacon.c
@@ -251,7 +251,7 @@ void ath9k_beacon_ensure_primary_slot(st
int first_slot = ATH_BCBUF;
int slot;
 
-   tasklet_disable(&sc->bcon_tasklet);
+   tasklet_disable_in_atomic(&sc->bcon_tasklet);
 
/* Find first taken slot. */
for (slot = 0; slot < ATH_BCBUF; slot++) {

[patch 08/14] net: jme: Replace link-change tasklet with work

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

The link change tasklet disables the tasklets for tx/rx processing while
upating hw parameters and then enables the tasklets again.

This update can also be pushed into a workqueue where it can be performed
in preemptible context. This allows tasklet_disable() to become sleeping.

Replace the linkch_task tasklet with a work.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
---
 drivers/net/ethernet/jme.c |   10 +-
 drivers/net/ethernet/jme.h |2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/drivers/net/ethernet/jme.c
+++ b/drivers/net/ethernet/jme.c
@@ -1265,9 +1265,9 @@ jme_stop_shutdown_timer(struct jme_adapt
jwrite32f(jme, JME_APMC, apmc);
 }
 
-static void jme_link_change_tasklet(struct tasklet_struct *t)
+static void jme_link_change_work(struct work_struct *work)
 {
-   struct jme_adapter *jme = from_tasklet(jme, t, linkch_task);
+   struct jme_adapter *jme = container_of(work, struct jme_adapter, 
linkch_task);
struct net_device *netdev = jme->dev;
int rc;
 
@@ -1510,7 +1510,7 @@ jme_intr_msi(struct jme_adapter *jme, u3
 * all other events are ignored
 */
jwrite32(jme, JME_IEVE, intrstat);
-   tasklet_schedule(&jme->linkch_task);
+   schedule_work(&jme->linkch_task);
goto out_reenable;
}
 
@@ -1832,7 +1832,6 @@ jme_open(struct net_device *netdev)
jme_clear_pm_disable_wol(jme);
JME_NAPI_ENABLE(jme);
 
-   tasklet_setup(&jme->linkch_task, jme_link_change_tasklet);
tasklet_setup(&jme->txclean_task, jme_tx_clean_tasklet);
tasklet_setup(&jme->rxclean_task, jme_rx_clean_tasklet);
tasklet_setup(&jme->rxempty_task, jme_rx_empty_tasklet);
@@ -1920,7 +1919,7 @@ jme_close(struct net_device *netdev)
 
JME_NAPI_DISABLE(jme);
 
-   tasklet_kill(&jme->linkch_task);
+   cancel_work_sync(&jme->linkch_task);
tasklet_kill(&jme->txclean_task);
tasklet_kill(&jme->rxclean_task);
tasklet_kill(&jme->rxempty_task);
@@ -3035,6 +3034,7 @@ jme_init_one(struct pci_dev *pdev,
atomic_set(&jme->rx_empty, 1);
 
tasklet_setup(&jme->pcc_task, jme_pcc_tasklet);
+   INIT_WORK(&jme->linkch_task, jme_link_change_work);
jme->dpi.cur = PCC_P1;
 
jme->reg_ghc = 0;
--- a/drivers/net/ethernet/jme.h
+++ b/drivers/net/ethernet/jme.h
@@ -411,7 +411,7 @@ struct jme_adapter {
struct tasklet_struct   rxempty_task;
struct tasklet_struct   rxclean_task;
struct tasklet_struct   txclean_task;
-   struct tasklet_struct   linkch_task;
+   struct work_struct  linkch_task;
struct tasklet_struct   pcc_task;
unsigned long   flags;
u32 reg_txcs;

[patch 09/14] net: sundance: Use tasklet_disable_in_atomic().

2021-03-09 Thread Thomas Gleixner

From: Sebastian Andrzej Siewior 

tasklet_disable() is used in the timer callback. This might be distangled,
but without access to the hardware that's a bit risky.

Replace it with tasklet_disable_in_atomic() so tasklet_disable() can be
changed to a sleep wait once all remaining atomic users are converted.

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Cc: Denis Kirjanov 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/dlink/sundance.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/dlink/sundance.c
+++ b/drivers/net/ethernet/dlink/sundance.c
@@ -963,7 +963,7 @@ static void tx_timeout(struct net_device
unsigned long flag;
 
netif_stop_queue(dev);
-   tasklet_disable(&np->tx_tasklet);
+   tasklet_disable_in_atomic(&np->tx_tasklet);
iowrite16(0, ioaddr + IntrEnable);
printk(KERN_WARNING "%s: Transmit timed out, TxStatus %2.2x "
   "TxFrameId %2.2x,"

[patch 07/14] tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RT

2021-03-09 Thread Thomas Gleixner

tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in
the tasklet state to be cleared. This works on !RT nicely because the
corresponding execution can only happen on a different CPU.

On RT softirq processing is preemptible, therefore a task preempting the
softirq processing thread can spin forever.

Prevent this by invoking local_bh_disable()/enable() inside the loop. In
case that the softirq processing thread was preempted by the current task,
current will block on the local lock which yields the CPU to the preempted
softirq processing thread. If the tasklet is processed on a different CPU
then the local_bh_disable()/enable() pair is just a waste of processor
cycles.

Signed-off-by: Thomas Gleixner 
Tested-by: Sebastian Andrzej Siewior 
---
 include/linux/interrupt.h |2 +-
 kernel/softirq.c  |   28 +++-
 2 files changed, 28 insertions(+), 2 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -658,7 +658,7 @@ enum
TASKLET_STATE_RUN   /* Tasklet is running (SMP only) */
 };
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
 static inline int tasklet_trylock(struct tasklet_struct *t)
 {
return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -616,6 +616,32 @@ void tasklet_init(struct tasklet_struct
 }
 EXPORT_SYMBOL(tasklet_init);
 
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
+/*
+ * Do not use in new code. There is no real reason to invoke this from
+ * atomic contexts.
+ */
+void tasklet_unlock_spin_wait(struct tasklet_struct *t)
+{
+   while (test_bit(TASKLET_STATE_RUN, &(t)->state)) {
+   if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+   /*
+* Prevent a live lock when current preempted soft
+* interrupt processing or prevents ksoftirqd from
+* running. If the tasklet runs on a different CPU
+* then this has no effect other than doing the BH
+* disable/enable dance for nothing.
+*/
+   local_bh_disable();
+   local_bh_enable();
+   } else {
+   cpu_relax();
+   }
+   }
+}
+EXPORT_SYMBOL(tasklet_unlock_spin_wait);
+#endif
+
 void tasklet_kill(struct tasklet_struct *t)
 {
if (in_interrupt())
@@ -629,7 +655,7 @@ void tasklet_kill(struct tasklet_struct
 }
 EXPORT_SYMBOL(tasklet_kill);
 
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
 void tasklet_unlock(struct tasklet_struct *t)
 {
smp_mb__before_atomic();

[patch 05/14] tasklets: Replace spin wait in tasklet_unlock_wait()

2021-03-09 Thread Thomas Gleixner

From: Peter Zijlstra 

tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This
is wasting CPU cycles in a tight loop which is especially painful in a
guest when the CPU running the tasklet is scheduled out.

tasklet_unlock_wait() is invoked from tasklet_kill() which is used in
teardown paths and not performance critical at all. Replace the spin wait
with wait_var_event().

There are no users of tasklet_unlock_wait() which are invoked from atomic
contexts. The usage in tasklet_disable() has been replaced temporarily with
the spin waiting variant until the atomic users are fixed up and will be
converted to the sleep wait variant later.

Signed-off-by: Peter Zijlstra 
Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |   13 ++---
 kernel/softirq.c  |   18 ++
 2 files changed, 20 insertions(+), 11 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -664,17 +664,8 @@ static inline int tasklet_trylock(struct
return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state);
 }
 
-static inline void tasklet_unlock(struct tasklet_struct *t)
-{
-   smp_mb__before_atomic();
-   clear_bit(TASKLET_STATE_RUN, &(t)->state);
-}
-
-static inline void tasklet_unlock_wait(struct tasklet_struct *t)
-{
-   while (test_bit(TASKLET_STATE_RUN, &t->state))
-   cpu_relax();
-}
+void tasklet_unlock(struct tasklet_struct *t);
+void tasklet_unlock_wait(struct tasklet_struct *t);
 
 /*
  * Do not use in new code. Waiting for tasklets from atomic contexts is
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -621,6 +622,23 @@ void tasklet_kill(struct tasklet_struct
 }
 EXPORT_SYMBOL(tasklet_kill);
 
+#ifdef CONFIG_SMP
+void tasklet_unlock(struct tasklet_struct *t)
+{
+   smp_mb__before_atomic();
+   clear_bit(TASKLET_STATE_RUN, &t->state);
+   smp_mb__after_atomic();
+   wake_up_var(&t->state);
+}
+EXPORT_SYMBOL_GPL(tasklet_unlock);
+
+void tasklet_unlock_wait(struct tasklet_struct *t)
+{
+   wait_var_event(&t->state, !test_bit(TASKLET_STATE_RUN, &t->state));
+}
+EXPORT_SYMBOL_GPL(tasklet_unlock_wait);
+#endif
+
 void __init softirq_init(void)
 {
int cpu;

[patch 04/14] tasklets: Use spin wait in tasklet_disable() temporarily

2021-03-09 Thread Thomas Gleixner

To ease the transition use spin waiting in tasklet_disable() until all
usage sites from atomic context have been cleaned up.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -728,7 +728,8 @@ static inline void tasklet_disable_in_at
 static inline void tasklet_disable(struct tasklet_struct *t)
 {
tasklet_disable_nosync(t);
-   tasklet_unlock_wait(t);
+   /* Spin wait until all atomic users are converted */
+   tasklet_unlock_spin_wait(t);
smp_mb();
 }

[patch 01/14] tasklets: Replace barrier() with cpu_relax() in tasklet_unlock_wait()

2021-03-09 Thread Thomas Gleixner

A barrier() in a tight loop which waits for something to happen on a remote
CPU is a pointless exercise. Replace it with cpu_relax() which allows HT
siblings to make progress.

Signed-off-by: Thomas Gleixner 
Tested-by: Sebastian Andrzej Siewior 
---
 include/linux/interrupt.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -672,7 +672,8 @@ static inline void tasklet_unlock(struct
 
 static inline void tasklet_unlock_wait(struct tasklet_struct *t)
 {
-   while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); }
+   while (test_bit(TASKLET_STATE_RUN, &t->state))
+   cpu_relax();
 }
 #else
 #define tasklet_trylock(t) 1

[patch 00/14] tasklets: Replace the spin wait loops and make it RT safe

2021-03-09 Thread Thomas Gleixner

This is a follow up to the review comments of the series which makes
softirq processing PREEMPT_RT safe:

 https://lore.kernel.org/r/20201207114743.gk3...@hirez.programming.kicks-ass.net

Peter suggested to replace the spin waiting in tasklet_disable() and
tasklet_kill() with wait_event(). This also gets rid of the ill defined
sched_yield() in tasklet_kill().

Analyzing all usage sites of tasklet_disable() and tasklet_unlock_wait() we
found that most of them are safe to be converted to a sleeping wait.

Only a few instances invoke tasklet_disable() from atomic context. A few
bugs which have been found in course of this analysis have been already
addressed seperately.

The following series takes the following approach:

1) Provide a variant of tasklet_disable() which can be invoked from
   atomic contexts

2) Convert the usage sites which cannot be easily changed to a
   sleepable wait to use this new function

3) Replace the spin waits in tasklet_disable() and tasklet_kill() with
   sleepable variants.

If this is agreed on then the merging can be either done in bulk or the
first 4 patches could be applied on top of rc2 and tagged for consumption
in the relevant subsystem trees (networking, pci, firewire). In this case
the last patch which changes the implementation of tasklet_disable() has to
be post-poned until all other changes have reached mainline.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git 
tasklet-2021-03-09

Thanks,

tglx

[patch 03/14] tasklets: Provide tasklet_disable_in_atomic()

2021-03-09 Thread Thomas Gleixner

Replacing the spin wait loops in tasklet_unlock_wait() with
wait_var_event() is not possible as a handful of tasklet_disable()
invocations are happening in atomic context. All other invocations are in
teardown paths which can sleep.

Provide tasklet_disable_in_atomic() and tasklet_unlock_spin_wait() to
convert the few atomic use cases over, which allows to change
tasklet_disable() and tasklet_unlock_wait() in a later step.

Signed-off-by: Thomas Gleixner 
---
 include/linux/interrupt.h |   22 ++
 1 file changed, 22 insertions(+)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -675,10 +675,21 @@ static inline void tasklet_unlock_wait(s
while (test_bit(TASKLET_STATE_RUN, &t->state))
cpu_relax();
 }
+
+/*
+ * Do not use in new code. Waiting for tasklets from atomic contexts is
+ * error prone and should be avoided.
+ */
+static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t)
+{
+   while (test_bit(TASKLET_STATE_RUN, &t->state))
+   cpu_relax();
+}
 #else
 static inline int tasklet_trylock(struct tasklet_struct *t) { return 1; }
 static inline void tasklet_unlock(struct tasklet_struct *t) { }
 static inline void tasklet_unlock_wait(struct tasklet_struct *t) { }
+static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t) { }
 #endif
 
 extern void __tasklet_schedule(struct tasklet_struct *t);
@@ -703,6 +714,17 @@ static inline void tasklet_disable_nosyn
smp_mb__after_atomic();
 }
 
+/*
+ * Do not use in new code. Disabling tasklets from atomic contexts is
+ * error prone and should be avoided.
+ */
+static inline void tasklet_disable_in_atomic(struct tasklet_struct *t)
+{
+   tasklet_disable_nosync(t);
+   tasklet_unlock_spin_wait(t);
+   smp_mb();
+}
+
 static inline void tasklet_disable(struct tasklet_struct *t)
 {
tasklet_disable_nosync(t);

[patch 02/14] tasklets: Use static inlines for stub implementations

2021-03-09 Thread Thomas Gleixner

Inlines exist for a reason.

Signed-off-by: Thomas Gleixner 
Tested-by: Sebastian Andrzej Siewior 
---
 include/linux/interrupt.h |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -676,9 +676,9 @@ static inline void tasklet_unlock_wait(s
cpu_relax();
 }
 #else
-#define tasklet_trylock(t) 1
-#define tasklet_unlock_wait(t) do { } while (0)
-#define tasklet_unlock(t) do { } while (0)
+static inline int tasklet_trylock(struct tasklet_struct *t) { return 1; }
+static inline void tasklet_unlock(struct tasklet_struct *t) { }
+static inline void tasklet_unlock_wait(struct tasklet_struct *t) { }
 #endif
 
 extern void __tasklet_schedule(struct tasklet_struct *t);

Re: [patch 02/30] genirq: Move status flag checks to core

2021-01-11 Thread Thomas Gleixner

On Sun, Dec 27 2020 at 11:20, Guenter Roeck wrote:
> On Thu, Dec 10, 2020 at 08:25:38PM +0100, Thomas Gleixner wrote:
> Yes, but that means that irq_check_status_bit() may be called from modules,
> but it is not exported, resulting in build errors such as the following.
>
> arm64:allmodconfig:
>
> ERROR: modpost: "irq_check_status_bit" [drivers/perf/arm_spe_pmu.ko] 
> undefined!

Duh. Yes, that lacks an export obviously.

Thanks,

tglx

Re: [stabe-rc 5.9 ] sched: core.c:7270 Illegal context switch in RCU-bh read-side critical section!

2020-12-16 Thread Thomas Gleixner

On Wed, Dec 16 2020 at 15:55, Naresh Kamboju wrote:
> On Tue, 15 Dec 2020 at 23:52, Jakub Kicinski  wrote:
>> > Or you could place checks for being in a BH-disable further up in
>> > the code.  Or build with CONFIG_DEBUG_INFO=y to allow more precise
>> > interpretation of this stack trace.
>
> I will try to reproduce this warning with DEBUG_INFO=y enabled kernel and
> get back to you with a better crash log.
>
>>
>> My money would be on the option that whatever run on this workqueue
>> before forgot to re-enable BH, but we already have a check for that...
>> Naresh, do you have the full log? Is there nothing like "BUG: workqueue
>> leaked lock" above the splat?

No, because it's in the middle of the work. The workqueue bug triggers
when the work has finished.

So cleanup_up() net does

   synchronize_rcu();   <- might sleep. So up to here it should be fine.

   list_for_each_entry_continue_reverse(ops, &pernet_list, list)
ops_exit_list(ops, &net_exit_list);

ops_exit_list() is called for each ops which then either invokes
ops->exit() or ops->exit_batch().

So one of those callbacks fails to reenable BH, so adding a check after
each invocation of ops->exit() and ops->exit_batch() for
!local_bh_disabled() should be able to identify the buggy callback.

Thanks,

tglx

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Thomas Gleixner

Andrew,

On Fri, Dec 11 2020 at 22:21, Andrew Cooper wrote:
> On 11/12/2020 21:27, Thomas Gleixner wrote:
>> It's not any different from the hardware example at least not as far as
>> I understood the code.
>
> Xen's event channels do have a couple of quirks.

Why am I not surprised?

> Binding an event channel always results in one spurious event being
> delivered.  This is to cover notifications which can get lost during the
> bidirectional setup, or re-setups in certain configurations.
>
> Binding an interdomain or pirq event channel always defaults to vCPU0. 
> There is no way to atomically set the affinity while binding.  I believe
> the API predates SMP guest support in Xen, and noone has fixed it up
> since.

That's fine. I'm not changing that.

What I'm changing is the unwanted and unnecessary overwriting of the
actual affinity mask.

We have a similar issue on real hardware where we can only target _one_
CPU and not all CPUs in the affinity mask. So we still can preserve the
(user) requested mask and just affine it to one CPU which is reflected
in the effective affinity mask. This the right thing to do for two
reasons:

   1) It allows proper interrupt distribution

   2) It does not break (user) requested affinity when the effective
  target CPU goes offline and the affinity mask still contains
  online CPUs. If you overwrite it you lost track of the requested
  broader mask.

> As a consequence, the guest will observe the event raised on vCPU0 as
> part of setting up the event, even if it attempts to set a different
> affinity immediately afterwards.  A little bit of care needs to be taken
> when binding an event channel on vCPUs other than 0, to ensure that the
> callback is safe with respect to any remaining state needing
> initialisation.

That's preserved for all non percpu interrupts. The percpu variant of
VIRQ and IPIs did binding to vCPU != 0 already before this change.

> Beyond this, there is nothing magic I'm aware of.
>
> We have seen soft lockups before in certain scenarios, simply due to the
> quantity of events hitting vCPU0 before irqbalance gets around to
> spreading the load.  This is why there is an attempt to round-robin the
> userspace event channel affinities by default, but I still don't see why
> this would need custom affinity logic itself.

Just the previous attempt makes no sense for the reasons I outlined in
the changelog. So now with this new spreading mechanics you get the
distribution for all cases:

  1) Post setup using and respecting the default affinity mask which can
 be set as a kernel commandline parameter.

  2) Runtime (user) requested affinity change with a mask which contains
 more than one vCPU. The previous logic always chose the first one
 in the mask.

 So assume userspace affines 4 irqs to a CPU 0-3 and 4 irqs to CPU
 4-7 then 4 irqs end up on CPU0 and 4 on CPU4

 The new algorithm which is similar to what we have on x86 (minus
 the vector space limitation) picks the CPU which has the least
 number of channels affine to it at that moment. If e.g. all 8 CPUs
 have the same number of vectors before that change then in the
 example above the first 4 are spread to CPU0-3 and the second 4 to
 CPU4-7

Thanks,

tglx

Re: [patch 03/30] genirq: Move irq_set_lockdep_class() to core

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 22:08, Thomas Gleixner wrote:

> On Fri, Dec 11 2020 at 19:53, Andy Shevchenko wrote:
>
>> On Thu, Dec 10, 2020 at 10:14 PM Thomas Gleixner  wrote:
>>>
>>> irq_set_lockdep_class() is used from modules and requires irq_to_desc() to
>>> be exported. Move it into the core code which lifts another requirement for
>>> the export.
>>
>> ...
>>
>>> +   if (IS_ENABLED(CONFIG_LOCKDEP))
>>> +   __irq_set_lockdep_class(irq, lock_class, request_class);
>
> You are right. Let me fix that.

No. I have to correct myself. You're wrong.

The inline is evaluated in the compilation units which include that
header and because the function declaration is unconditional it is
happy.

Now the optimizer stage makes the whole thing a NOOP if CONFIG_LOCKDEP=n
and thereby drops the reference to the function which makes it not
required for linking.

So in the file where the function is implemented:

#ifdef CONFIG_LOCKDEP
void __irq_set_lockdep_class()
{
}
#endif

The whole block is either discarded because CONFIG_LOCKDEP is not
defined or compile if it is defined which makes it available for the
linker.

And in the latter case the optimizer keeps the call in the inline (it
optimizes the condition away because it's always true).

So in both cases the compiler and the linker are happy and everything
works as expected.

It would fail if the header file had the following:

#ifdef CONFIG_LOCKDEP
void __irq_set_lockdep_class();
#endif

Because then it would complain about the missing function prototype when
it evaluates the inline.

Thanks,

tglx

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 09:29, boris ostrovsky wrote:

> On 12/11/20 7:37 AM, Thomas Gleixner wrote:
>> On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote:
>>> On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:
>>>> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>>>>> Change the implementation so that the channel is bound to CPU0 at the XEN
>>>>> level and leave the affinity mask alone. At startup of the interrupt
>>>>> affinity will be assigned out of the affinity mask and the XEN binding 
>>>>> will
>>>>> be updated.
>>>>
>>>> If that's the case then I wonder whether we need this call at all and 
>>>> instead bind at startup time.
>>> After some discussion with Thomas on IRC and xen-devel archaeology the
>>> result is: this will be needed especially for systems running on a
>>> single vcpu (e.g. small guests), as the .irq_set_affinity() callback
>>> won't be called in this case when starting the irq.
>
> On UP are we not then going to end up with an empty affinity mask? Or
> are we guaranteed to have it set to 1 by interrupt generic code?

An UP kernel does not ever look on the affinity mask. The
chip::irq_set_affinity() callback is not invoked so the mask is
irrelevant.

A SMP kernel on a UP machine sets CPU0 in the mask so all is good.

> This is actually why I brought this up in the first place --- a
> potential mismatch between the affinity mask and Xen-specific data
> (e.g. info->cpu and then protocol-specific data in event channel
> code). Even if they are re-synchronized later, at startup time (for
> SMP).

Which is not a problem either. The affinity mask is only relevant for
setting the affinity, but it's not relevant for delivery and never can
be.

> I don't see anything that would cause a problem right now but I worry
> that this inconsistency may come up at some point.

As long as the affinity mask becomes not part of the event channel magic
this should never matter.

Look at it from hardware:

interrupt is affine to CPU0

 CPU0 runs:

 set_affinity(CPU0 -> CPU1)
local_irq_disable()

 --> interrupt is raised in hardware and pending on CPU0

irq hardware is reconfigured to be affine to CPU1

local_irq_enable()

 --> interrupt is handled on CPU0

the next interrupt will be raised on CPU1

So info->cpu which is registered via the hypercall binds the 'hardware
delivery' and whenever the new affinity is written it is rebound to some
other CPU and the next interrupt is then raised on this other CPU.

It's not any different from the hardware example at least not as far as
I understood the code.

Thanks,

tglx

RE: [patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 14:19, David Laight wrote:
> From: Thomas Gleixner
>> You can't catch that. If this really becomes an issue you need a
>> sequence counter around it.
>
> Or just two copies of the high word.
> Provided the accesses are sequenced:
> writer:
>   load high:low
>   add small_value,high:low
>   store high
>   store low
>   store high_copy
> reader:
>   load high_copy
>   load low
>   load high
>   if (high != high_copy)
>   low = 0;

And low = 0 is solving what? You need to loop back and retry until it's
consistent and then it's nothing else than an open coded sequence count.

Thanks,

tglx

Re: [patch 03/30] genirq: Move irq_set_lockdep_class() to core

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 19:53, Andy Shevchenko wrote:

> On Thu, Dec 10, 2020 at 10:14 PM Thomas Gleixner  wrote:
>>
>> irq_set_lockdep_class() is used from modules and requires irq_to_desc() to
>> be exported. Move it into the core code which lifts another requirement for
>> the export.
>
> ...
>
>> +   if (IS_ENABLED(CONFIG_LOCKDEP))
>> +   __irq_set_lockdep_class(irq, lock_class, request_class);

You are right. Let me fix that.

Re: [patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 10:13, Tvrtko Ursulin wrote:
> On 10/12/2020 19:25, Thomas Gleixner wrote:

>> 
>> Aside of that the count is per interrupt line and therefore takes
>> interrupts from other devices into account which share the interrupt line
>> and are not handled by the graphics driver.
>> 
>> Replace it with a pmu private count which only counts interrupts which
>> originate from the graphics card.
>> 
>> To avoid atomics or heuristics of some sort make the counter field
>> 'unsigned long'. That limits the count to 4e9 on 32bit which is a lot and
>> postprocessing can easily deal with the occasional wraparound.
>
> After my failed hasty sketch from last night I had a different one which 
> was kind of heuristics based (re-reading the upper dword and retrying if 
> it changed on 32-bit).

The problem is that there will be two seperate modifications for the low
and high word. Several ways how the compiler can translate this, but the
problem is the same for all of them:

CPU 0   CPU 1
load low
load high
add  low, 1
addc high, 0
store low   load high
--> NMI load low
load high and compare
store high

You can't catch that. If this really becomes an issue you need a
sequence counter around it.
  

> But you are right - it is okay to at least start 
> like this today and if later there is a need we can either do that or 
> deal with wrap at PMU read time.

Right.

>> +/*
>> + * Interrupt statistic for PMU. Increments the counter only if the
>> + * interrupt originated from the the GPU so interrupts from a device which
>> + * shares the interrupt line are not accounted.
>> + */
>> +static inline void pmu_irq_stats(struct drm_i915_private *priv,
>
> We never use priv as a local name, it should be either i915 or
> dev_priv.

Sure, will fix.

>> +/*
>> + * A clever compiler translates that into INC. A not so clever one
>> + * should at least prevent store tearing.
>> + */
>> +WRITE_ONCE(priv->pmu.irq_count, priv->pmu.irq_count + 1);
>
> Curious, probably more educational for me - given x86_32 and x86_64, and 
> the context of it getting called, what is the difference from just doing 
> irq_count++?

Several reasons:

1) The compiler can pretty much do what it wants with cnt++
   including tearing and whatever. https://lwn.net/Articles/816850/
   for the full set of insanities.

   Not really a problem here, but

2) It's annotating the reader and the writer side and documenting
   that this is subject to concurrency

3) It will prevent KCSAN to complain about the data race,
   i.e. concurrent modification while reading.

Thanks,

tglx

>> --- a/drivers/gpu/drm/i915/i915_pmu.c
>> +++ b/drivers/gpu/drm/i915/i915_pmu.c
>> @@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample(
>>  return HRTIMER_RESTART;
>>   }
>
> In this file you can also drop the #include  line.

Indeed.

Thanks,

tglx

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote:
> On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:
>> 
>> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>>> All event channel setups bind the interrupt on CPU0 or the target CPU for
>>> percpu interrupts and overwrite the affinity mask with the corresponding
>>> cpumask. That does not make sense.
>>>
>>> The XEN implementation of irqchip::irq_set_affinity() already picks a
>>> single target CPU out of the affinity mask and the actual target is stored
>>> in the effective CPU mask, so destroying the user chosen affinity mask
>>> which might contain more than one CPU is wrong.
>>>
>>> Change the implementation so that the channel is bound to CPU0 at the XEN
>>> level and leave the affinity mask alone. At startup of the interrupt
>>> affinity will be assigned out of the affinity mask and the XEN binding will
>>> be updated.
>> 
>> 
>> If that's the case then I wonder whether we need this call at all and 
>> instead bind at startup time.
>
> After some discussion with Thomas on IRC and xen-devel archaeology the
> result is: this will be needed especially for systems running on a
> single vcpu (e.g. small guests), as the .irq_set_affinity() callback
> won't be called in this case when starting the irq.

That's right, but not limited to ARM. The same problem exists on x86 UP.
So yes, the call makes sense, but the changelog is not really useful.
Let me add a comment to this.

Thanks,

tglx

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-11 Thread Thomas Gleixner

On Fri, Dec 11 2020 at 07:17, Jürgen Groß wrote:
> On 11.12.20 00:20, boris.ostrov...@oracle.com wrote:
>> 
>> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>>> All event channel setups bind the interrupt on CPU0 or the target CPU for
>>> percpu interrupts and overwrite the affinity mask with the corresponding
>>> cpumask. That does not make sense.
>>>
>>> The XEN implementation of irqchip::irq_set_affinity() already picks a
>>> single target CPU out of the affinity mask and the actual target is stored
>>> in the effective CPU mask, so destroying the user chosen affinity mask
>>> which might contain more than one CPU is wrong.
>>>
>>> Change the implementation so that the channel is bound to CPU0 at the XEN
>>> level and leave the affinity mask alone. At startup of the interrupt
>>> affinity will be assigned out of the affinity mask and the XEN binding will
>>> be updated.
>> 
>> 
>> If that's the case then I wonder whether we need this call at all and 
>> instead bind at startup time.
>
> This binding to cpu0 was introduced with commit 97253eeeb792d61ed2
> and I have no reason to believe the underlying problem has been
> eliminated.

"The kernel-side VCPU binding was not being correctly set for newly
 allocated or bound interdomain events.  In ARM guests where 2-level
 events were used, this would result in no interdomain events being
 handled because the kernel-side VCPU masks would all be clear.

 x86 guests would work because the irq affinity was set during irq
 setup and this would set the correct kernel-side VCPU binding."

I'm not convinced that this is really correctly analyzed because affinity
setting is done at irq startup.

switch (__irq_startup_managed(desc, aff, force)) {
case IRQ_STARTUP_NORMAL:
ret = __irq_startup(desc);
irq_setup_affinity(desc);
break;

which is completely architecture agnostic. So why should this magically
work on x86 and not on ARM if both are using the same XEN irqchip with
the same irqchip callbacks.

Thanks,

tglx

Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-10 Thread Thomas Gleixner

On Thu, Dec 10 2020 at 18:20, boris ostrovsky wrote:
> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>> All event channel setups bind the interrupt on CPU0 or the target CPU for
>> percpu interrupts and overwrite the affinity mask with the corresponding
>> cpumask. That does not make sense.
>>
>> The XEN implementation of irqchip::irq_set_affinity() already picks a
>> single target CPU out of the affinity mask and the actual target is stored
>> in the effective CPU mask, so destroying the user chosen affinity mask
>> which might contain more than one CPU is wrong.
>>
>> Change the implementation so that the channel is bound to CPU0 at the XEN
>> level and leave the affinity mask alone. At startup of the interrupt
>> affinity will be assigned out of the affinity mask and the XEN binding will
>> be updated. 
>
> If that's the case then I wonder whether we need this call at all and
> instead bind at startup time.

I was wondering about that, but my knowledge about the Xen internal
requirements is pretty limited. The current set at least survived basic
testing by Jürgen.

Thanks,

tglx

Re: [patch 24/30] xen/events: Remove unused bind_evtchn_to_irq_lateeoi()

2020-12-10 Thread Thomas Gleixner

On Thu, Dec 10 2020 at 18:19, boris ostrovsky wrote:
> On 12/10/20 2:26 PM, Thomas Gleixner wrote:
>> -EXPORT_SYMBOL_GPL(bind_evtchn_to_irq_lateeoi);
>
> include/xen/events.h also needs to be updated (and in the next patch for 
> xen_set_affinity_evtchn() as well).

Darn, I lost that.

[patch 01/30] genirq: Move irq_has_action() into core code

2020-12-10 Thread Thomas Gleixner

This function uses irq_to_desc() and is going to be used by modules to
replace the open coded irq_to_desc() (ab)usage. The final goal is to remove
the export of irq_to_desc() so driver cannot fiddle with it anymore.

Move it into the core code and fixup the usage sites to include the proper
header.

Signed-off-by: Thomas Gleixner 
---
 arch/alpha/kernel/sys_jensen.c |2 +-
 arch/x86/kernel/topology.c |1 +
 include/linux/interrupt.h  |1 +
 include/linux/irqdesc.h|7 +--
 kernel/irq/manage.c|   17 +
 5 files changed, 21 insertions(+), 7 deletions(-)

--- a/arch/alpha/kernel/sys_jensen.c
+++ b/arch/alpha/kernel/sys_jensen.c
@@ -7,7 +7,7 @@
  *
  * Code supporting the Jensen.
  */
-
+#include 
 #include 
 #include 
 #include 
--- a/arch/x86/kernel/topology.c
+++ b/arch/x86/kernel/topology.c
@@ -25,6 +25,7 @@
  *
  * Send feedback to 
  */
+#include 
 #include 
 #include 
 #include 
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -232,6 +232,7 @@ extern void devm_free_irq(struct device
 # define local_irq_enable_in_hardirq() local_irq_enable()
 #endif
 
+bool irq_has_action(unsigned int irq);
 extern void disable_irq_nosync(unsigned int irq);
 extern bool disable_hardirq(unsigned int irq);
 extern void disable_irq(unsigned int irq);
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -179,12 +179,7 @@ int handle_domain_nmi(struct irq_domain
 /* Test to see if a driver has successfully requested an irq */
 static inline int irq_desc_has_action(struct irq_desc *desc)
 {
-   return desc->action != NULL;
-}
-
-static inline int irq_has_action(unsigned int irq)
-{
-   return irq_desc_has_action(irq_to_desc(irq));
+   return desc && desc->action != NULL;
 }
 
 /**
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2752,3 +2752,20 @@ int irq_set_irqchip_state(unsigned int i
return err;
 }
 EXPORT_SYMBOL_GPL(irq_set_irqchip_state);
+
+/**
+ * irq_has_action - Check whether an interrupt is requested
+ * @irq:   The linux irq number
+ *
+ * Returns: A snapshot of the current state
+ */
+bool irq_has_action(unsigned int irq)
+{
+   bool res;
+
+   rcu_read_lock();
+   res = irq_desc_has_action(irq_to_desc(irq));
+   rcu_read_unlock();
+   return res;
+}
+EXPORT_SYMBOL_GPL(irq_has_action);

[patch 03/30] genirq: Move irq_set_lockdep_class() to core

2020-12-10 Thread Thomas Gleixner

irq_set_lockdep_class() is used from modules and requires irq_to_desc() to
be exported. Move it into the core code which lifts another requirement for
the export.

Signed-off-by: Thomas Gleixner 
---
 include/linux/irqdesc.h |   10 --
 kernel/irq/irqdesc.c|   14 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -240,16 +240,14 @@ static inline bool irq_is_percpu_devid(u
return irq_check_status_bit(irq, IRQ_PER_CPU_DEVID);
 }
 
+void __irq_set_lockdep_class(unsigned int irq, struct lock_class_key 
*lock_class,
+struct lock_class_key *request_class);
 static inline void
 irq_set_lockdep_class(unsigned int irq, struct lock_class_key *lock_class,
  struct lock_class_key *request_class)
 {
-   struct irq_desc *desc = irq_to_desc(irq);
-
-   if (desc) {
-   lockdep_set_class(&desc->lock, lock_class);
-   lockdep_set_class(&desc->request_mutex, request_class);
-   }
+   if (IS_ENABLED(CONFIG_LOCKDEP))
+   __irq_set_lockdep_class(irq, lock_class, request_class);
 }
 
 #endif
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -968,3 +968,17 @@ unsigned int kstat_irqs_usr(unsigned int
rcu_read_unlock();
return sum;
 }
+
+#ifdef CONFIG_LOCKDEP
+void __irq_set_lockdep_class(unsigned int irq, struct lock_class_key 
*lock_class,
+struct lock_class_key *request_class)
+{
+   struct irq_desc *desc = irq_to_desc(irq);
+
+   if (desc) {
+   lockdep_set_class(&desc->lock, lock_class);
+   lockdep_set_class(&desc->request_mutex, request_class);
+   }
+}
+EXPORT_SYMBOL_GPL(irq_set_lockdep_class);
+#endif

[patch 06/30] parisc/irq: Simplify irq count output for /proc/interrupts

2020-12-10 Thread Thomas Gleixner

The SMP variant works perfectly fine on UP as well.

Signed-off-by: Thomas Gleixner 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: afzal mohammed 
Cc: linux-par...@vger.kernel.org
---
 arch/parisc/kernel/irq.c |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -216,12 +216,9 @@ int show_interrupts(struct seq_file *p,
if (!action)
goto skip;
seq_printf(p, "%3d: ", i);
-#ifdef CONFIG_SMP
+
for_each_online_cpu(j)
seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
-#else
-   seq_printf(p, "%10u ", kstat_irqs(i));
-#endif
 
seq_printf(p, " %14s", irq_desc_get_chip(desc)->name);
 #ifndef PARISC_IRQ_CR16_COUNTS

[patch 07/30] genirq: Make kstat_irqs() static

2020-12-10 Thread Thomas Gleixner

No more users outside the core code.

Signed-off-by: Thomas Gleixner 
---
 include/linux/kernel_stat.h |1 -
 kernel/irq/irqdesc.c|   19 ++-
 2 files changed, 6 insertions(+), 14 deletions(-)

--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -67,7 +67,6 @@ static inline unsigned int kstat_softirq
 /*
  * Number of interrupts per specific IRQ source, since bootup
  */
-extern unsigned int kstat_irqs(unsigned int irq);
 extern unsigned int kstat_irqs_usr(unsigned int irq);
 
 /*
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -924,15 +924,7 @@ static bool irq_is_nmi(struct irq_desc *
return desc->istate & IRQS_NMI;
 }
 
-/**
- * kstat_irqs - Get the statistics for an interrupt
- * @irq:   The interrupt number
- *
- * Returns the sum of interrupt counts on all cpus since boot for
- * @irq. The caller must ensure that the interrupt is not removed
- * concurrently.
- */
-unsigned int kstat_irqs(unsigned int irq)
+static unsigned int kstat_irqs(unsigned int irq)
 {
struct irq_desc *desc = irq_to_desc(irq);
unsigned int sum = 0;
@@ -951,13 +943,14 @@ unsigned int kstat_irqs(unsigned int irq
 }
 
 /**
- * kstat_irqs_usr - Get the statistics for an interrupt
+ * kstat_irqs_usr - Get the statistics for an interrupt from thread context
  * @irq:   The interrupt number
  *
  * Returns the sum of interrupt counts on all cpus since boot for @irq.
- * Contrary to kstat_irqs() this can be called from any context.
- * It uses rcu since a concurrent removal of an interrupt descriptor is
- * observing an rcu grace period before delayed_free_desc()/irq_kobj_release().
+ *
+ * It uses rcu to protect the access since a concurrent removal of an
+ * interrupt descriptor is observing an rcu grace period before
+ * delayed_free_desc()/irq_kobj_release().
  */
 unsigned int kstat_irqs_usr(unsigned int irq)
 {

[patch 02/30] genirq: Move status flag checks to core

2020-12-10 Thread Thomas Gleixner

These checks are used by modules and prevent the removal of the export of
irq_to_desc(). Move the accessor into the core.

Signed-off-by: Thomas Gleixner 
---
 include/linux/irqdesc.h |   17 +
 kernel/irq/manage.c |   17 +
 2 files changed, 22 insertions(+), 12 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -223,28 +223,21 @@ irq_set_chip_handler_name_locked(struct
data->chip = chip;
 }
 
+bool irq_check_status_bit(unsigned int irq, unsigned int bitmask);
+
 static inline bool irq_balancing_disabled(unsigned int irq)
 {
-   struct irq_desc *desc;
-
-   desc = irq_to_desc(irq);
-   return desc->status_use_accessors & IRQ_NO_BALANCING_MASK;
+   return irq_check_status_bit(irq, IRQ_NO_BALANCING_MASK);
 }
 
 static inline bool irq_is_percpu(unsigned int irq)
 {
-   struct irq_desc *desc;
-
-   desc = irq_to_desc(irq);
-   return desc->status_use_accessors & IRQ_PER_CPU;
+   return irq_check_status_bit(irq, IRQ_PER_CPU);
 }
 
 static inline bool irq_is_percpu_devid(unsigned int irq)
 {
-   struct irq_desc *desc;
-
-   desc = irq_to_desc(irq);
-   return desc->status_use_accessors & IRQ_PER_CPU_DEVID;
+   return irq_check_status_bit(irq, IRQ_PER_CPU_DEVID);
 }
 
 static inline void
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2769,3 +2769,23 @@ bool irq_has_action(unsigned int irq)
return res;
 }
 EXPORT_SYMBOL_GPL(irq_has_action);
+
+/**
+ * irq_check_status_bit - Check whether bits in the irq descriptor status are 
set
+ * @irq:   The linux irq number
+ * @bitmask:   The bitmask to evaluate
+ *
+ * Returns: True if one of the bits in @bitmask is set
+ */
+bool irq_check_status_bit(unsigned int irq, unsigned int bitmask)
+{
+   struct irq_desc *desc;
+   bool res = false;
+
+   rcu_read_lock();
+   desc = irq_to_desc(irq);
+   if (desc)
+   res = !!(desc->status_use_accessors & bitmask);
+   rcu_read_unlock();
+   return res;
+}

[patch 21/30] net/mlx4: Use effective interrupt affinity

2020-12-10 Thread Thomas Gleixner

Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.

The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.

Signed-off-by: Thomas Gleixner 
Cc: Tariq Toukan 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -117,7 +117,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p
assigned_eq = true;
}
irq = mlx4_eq_get_irq(mdev->dev, cq->vector);
-   cq->aff_mask = irq_get_affinity_mask(irq);
+   cq->aff_mask = irq_get_effective_affinity_mask(irq);
} else {
/* For TX we use the same irq per
ring we assigned for the RX*/

[patch 18/30] PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()

2020-12-10 Thread Thomas Gleixner

Going through a full irq descriptor lookup instead of just using the proper
helper function which provides direct access is suboptimal.

In fact it _is_ wrong because the chip callback needs to get the chip data
which is relevant for the chip while using the irq descriptor variant
returns the irq chip data of the top level chip of a hierarchy. It does not
matter in this case because the chip is the top level chip, but that
doesn't make it more correct.

Signed-off-by: Thomas Gleixner 
Cc: Lorenzo Pieralisi 
Cc: Rob Herring 
Cc: Bjorn Helgaas 
Cc: Michal Simek 
Cc: linux-...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/pci/controller/pcie-xilinx-nwl.c |8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

--- a/drivers/pci/controller/pcie-xilinx-nwl.c
+++ b/drivers/pci/controller/pcie-xilinx-nwl.c
@@ -379,13 +379,11 @@ static void nwl_pcie_msi_handler_low(str
 
 static void nwl_mask_leg_irq(struct irq_data *data)
 {
-   struct irq_desc *desc = irq_to_desc(data->irq);
-   struct nwl_pcie *pcie;
+   struct nwl_pcie *pcie = irq_data_get_irq_chip_data(data);
unsigned long flags;
u32 mask;
u32 val;
 
-   pcie = irq_desc_get_chip_data(desc);
mask = 1 << (data->hwirq - 1);
raw_spin_lock_irqsave(&pcie->leg_mask_lock, flags);
val = nwl_bridge_readl(pcie, MSGF_LEG_MASK);
@@ -395,13 +393,11 @@ static void nwl_mask_leg_irq(struct irq_
 
 static void nwl_unmask_leg_irq(struct irq_data *data)
 {
-   struct irq_desc *desc = irq_to_desc(data->irq);
-   struct nwl_pcie *pcie;
+   struct nwl_pcie *pcie = irq_data_get_irq_chip_data(data);
unsigned long flags;
u32 mask;
u32 val;
 
-   pcie = irq_desc_get_chip_data(desc);
mask = 1 << (data->hwirq - 1);
raw_spin_lock_irqsave(&pcie->leg_mask_lock, flags);
val = nwl_bridge_readl(pcie, MSGF_LEG_MASK);

[patch 25/30] xen/events: Remove disfunct affinity spreading

2020-12-10 Thread Thomas Gleixner

This function can only ever work when the event channels:

  - are already established
  - interrupts assigned to them
  - the affinity has been set by user space already

because any newly set up event channel is forced to be bound to CPU0 and
the affinity mask of the interrupt is forced to contain cpumask_of(0).

As the CPU0 enforcement was in place _before_ this was implemented it's
entirely unclear how that can ever have worked at all.

Remove it as preparation for doing it proper.

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |9 -
 drivers/xen/evtchn.c |   34 +-
 2 files changed, 1 insertion(+), 42 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1696,15 +1696,6 @@ static int set_affinity_irq(struct irq_d
return ret;
 }
 
-/* To be called with desc->lock held. */
-int xen_set_affinity_evtchn(struct irq_desc *desc, unsigned int tcpu)
-{
-   struct irq_data *d = irq_desc_get_irq_data(desc);
-
-   return set_affinity_irq(d, cpumask_of(tcpu), false);
-}
-EXPORT_SYMBOL_GPL(xen_set_affinity_evtchn);
-
 static void enable_dynirq(struct irq_data *data)
 {
evtchn_port_t evtchn = evtchn_from_irq(data->irq);
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -421,36 +421,6 @@ static void evtchn_unbind_from_user(stru
del_evtchn(u, evtchn);
 }
 
-static DEFINE_PER_CPU(int, bind_last_selected_cpu);
-
-static void evtchn_bind_interdom_next_vcpu(evtchn_port_t evtchn)
-{
-   unsigned int selected_cpu, irq;
-   struct irq_desc *desc;
-   unsigned long flags;
-
-   irq = irq_from_evtchn(evtchn);
-   desc = irq_to_desc(irq);
-
-   if (!desc)
-   return;
-
-   raw_spin_lock_irqsave(&desc->lock, flags);
-   selected_cpu = this_cpu_read(bind_last_selected_cpu);
-   selected_cpu = cpumask_next_and(selected_cpu,
-   desc->irq_common_data.affinity, cpu_online_mask);
-
-   if (unlikely(selected_cpu >= nr_cpu_ids))
-   selected_cpu = cpumask_first_and(desc->irq_common_data.affinity,
-   cpu_online_mask);
-
-   this_cpu_write(bind_last_selected_cpu, selected_cpu);
-
-   /* unmask expects irqs to be disabled */
-   xen_set_affinity_evtchn(desc, selected_cpu);
-   raw_spin_unlock_irqrestore(&desc->lock, flags);
-}
-
 static long evtchn_ioctl(struct file *file,
 unsigned int cmd, unsigned long arg)
 {
@@ -508,10 +478,8 @@ static long evtchn_ioctl(struct file *fi
break;
 
rc = evtchn_bind_to_user(u, bind_interdomain.local_port);
-   if (rc == 0) {
+   if (rc == 0)
rc = bind_interdomain.local_port;
-   evtchn_bind_interdom_next_vcpu(rc);
-   }
break;
}

[patch 26/30] xen/events: Use immediate affinity setting

2020-12-10 Thread Thomas Gleixner

There is absolutely no reason to mimic the x86 deferred affinity
setting. This mechanism is required to handle the hardware induced issues
of IO/APIC and MSI and is not in use when the interrupts are remapped.

XEN does not need this and can simply change the affinity from the calling
context. The core code invokes this with the interrupt descriptor lock held
so it is fully serialized against any other operation.

Mark the interrupts with IRQ_MOVE_PCNTXT to disable the deferred affinity
setting. The conditional mask/unmask operation is already handled in
xen_rebind_evtchn_to_cpu().

This makes XEN on x86 use the same mechanics as on e.g. ARM64 where
deferred affinity setting is not required and not implemented and the code
path in the ack functions is compiled out.

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |   35 +--
 1 file changed, 9 insertions(+), 26 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -628,6 +628,11 @@ static void xen_irq_init(unsigned irq)
info->refcnt = -1;
 
set_info_for_irq(irq, info);
+   /*
+* Interrupt affinity setting can be immediate. No point
+* in delaying it until an interrupt is handled.
+*/
+   irq_set_status_flags(irq, IRQ_MOVE_PCNTXT);
 
INIT_LIST_HEAD(&info->eoi_list);
list_add_tail(&info->list, &xen_irq_list_head);
@@ -739,18 +744,7 @@ static void eoi_pirq(struct irq_data *da
if (!VALID_EVTCHN(evtchn))
return;
 
-   if (unlikely(irqd_is_setaffinity_pending(data)) &&
-   likely(!irqd_irq_disabled(data))) {
-   int masked = test_and_set_mask(evtchn);
-
-   clear_evtchn(evtchn);
-
-   irq_move_masked_irq(data);
-
-   if (!masked)
-   unmask_evtchn(evtchn);
-   } else
-   clear_evtchn(evtchn);
+   clear_evtchn(evtchn);
 
if (pirq_needs_eoi(data->irq)) {
rc = HYPERVISOR_physdev_op(PHYSDEVOP_eoi, &eoi);
@@ -1641,7 +1635,6 @@ void rebind_evtchn_irq(evtchn_port_t evt
mutex_unlock(&irq_mapping_update_lock);
 
 bind_evtchn_to_cpu(evtchn, info->cpu);
-   /* This will be deferred until interrupt is processed */
irq_set_affinity(irq, cpumask_of(info->cpu));
 
/* Unmask the event channel. */
@@ -1688,8 +1681,9 @@ static int set_affinity_irq(struct irq_d
bool force)
 {
unsigned tcpu = cpumask_first_and(dest, cpu_online_mask);
-   int ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu);
+   int ret;
 
+   ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu);
if (!ret)
irq_data_update_effective_affinity(data, cpumask_of(tcpu));
 
@@ -1719,18 +1713,7 @@ static void ack_dynirq(struct irq_data *
if (!VALID_EVTCHN(evtchn))
return;
 
-   if (unlikely(irqd_is_setaffinity_pending(data)) &&
-   likely(!irqd_irq_disabled(data))) {
-   int masked = test_and_set_mask(evtchn);
-
-   clear_evtchn(evtchn);
-
-   irq_move_masked_irq(data);
-
-   if (!masked)
-   unmask_evtchn(evtchn);
-   } else
-   clear_evtchn(evtchn);
+   clear_evtchn(evtchn);
 }
 
 static void mask_ack_dynirq(struct irq_data *data)

[patch 08/30] genirq: Provide kstat_irqdesc_cpu()

2020-12-10 Thread Thomas Gleixner

Most users of kstat_irqs_cpu() have the irq descriptor already. No point in
calling into the core code and looking it up once more.

Use it in per_cpu_count_show() to start with.

Signed-off-by: Thomas Gleixner 
---
 include/linux/irqdesc.h |6 ++
 kernel/irq/irqdesc.c|4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -113,6 +113,12 @@ static inline void irq_unlock_sparse(voi
 extern struct irq_desc irq_desc[NR_IRQS];
 #endif
 
+static inline unsigned int irq_desc_kstat_cpu(struct irq_desc *desc,
+ unsigned int cpu)
+{
+   return desc->kstat_irqs ? *per_cpu_ptr(desc->kstat_irqs, cpu) : 0;
+}
+
 static inline struct irq_desc *irq_data_to_desc(struct irq_data *data)
 {
return container_of(data->common, struct irq_desc, irq_common_data);
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -147,12 +147,12 @@ static ssize_t per_cpu_count_show(struct
  struct kobj_attribute *attr, char *buf)
 {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
-   int cpu, irq = desc->irq_data.irq;
ssize_t ret = 0;
char *p = "";
+   int cpu;
 
for_each_possible_cpu(cpu) {
-   unsigned int c = kstat_irqs_cpu(irq, cpu);
+   unsigned int c = irq_desc_kstat_cpu(desc, cpu);
 
ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%s%u", p, c);
p = ",";

[patch 11/30] parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()

2020-12-10 Thread Thomas Gleixner

The irq descriptor is already there, no need to look it up again.

Signed-off-by: Thomas Gleixner 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: afzal mohammed 
Cc: linux-par...@vger.kernel.org
---
 arch/parisc/kernel/irq.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -218,7 +218,7 @@ int show_interrupts(struct seq_file *p,
seq_printf(p, "%3d: ", i);
 
for_each_online_cpu(j)
-   seq_printf(p, "%10u ", kstat_irqs_cpu(i, j));
+   seq_printf(p, "%10u ", irq_desc_kstat_cpu(desc, j));
 
seq_printf(p, " %14s", irq_desc_get_chip(desc)->name);
 #ifndef PARISC_IRQ_CR16_COUNTS

[patch 10/30] arm64/smp: Use irq_desc_kstat_cpu() in arch_show_interrupts()

2020-12-10 Thread Thomas Gleixner

The irq descriptor is already there, no need to look it up again.

Signed-off-by: Thomas Gleixner 
Cc: Mark Rutland 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Marc Zyngier 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm64/kernel/smp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -809,7 +809,7 @@ int arch_show_interrupts(struct seq_file
seq_printf(p, "%*s%u:%s", prec - 1, "IPI", i,
   prec >= 4 ? " " : "");
for_each_online_cpu(cpu)
-   seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu));
+   seq_printf(p, "%10u ", irq_desc_kstat_cpu(ipi_desc[i], 
cpu));
seq_printf(p, "  %s\n", ipi_types[i]);
}

[patch 12/30] s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()

2020-12-10 Thread Thomas Gleixner

The irq descriptor is already there, no need to look it up again.

Signed-off-by: Thomas Gleixner 
Cc: Christian Borntraeger 
Cc: Heiko Carstens 
Cc: linux-s...@vger.kernel.org
---
 arch/s390/kernel/irq.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -124,7 +124,7 @@ static void show_msi_interrupt(struct se
raw_spin_lock_irqsave(&desc->lock, flags);
seq_printf(p, "%3d: ", irq);
for_each_online_cpu(cpu)
-   seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu));
+   seq_printf(p, "%10u ", irq_desc_kstat_irq(desc, cpu));
 
if (desc->irq_data.chip)
seq_printf(p, " %8s", desc->irq_data.chip->name);

[patch 13/30] drm/i915/lpe_audio: Remove pointless irq_to_desc() usage

2020-12-10 Thread Thomas Gleixner

Nothing uses the result and nothing should ever use it in driver code.

Signed-off-by: Thomas Gleixner 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Pankaj Bharadiya 
Cc: Chris Wilson 
Cc: Wambui Karuga 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/i915/display/intel_lpe_audio.c |4 
 1 file changed, 4 deletions(-)

--- a/drivers/gpu/drm/i915/display/intel_lpe_audio.c
+++ b/drivers/gpu/drm/i915/display/intel_lpe_audio.c
@@ -297,13 +297,9 @@ int intel_lpe_audio_init(struct drm_i915
  */
 void intel_lpe_audio_teardown(struct drm_i915_private *dev_priv)
 {
-   struct irq_desc *desc;
-
if (!HAS_LPE_AUDIO(dev_priv))
return;
 
-   desc = irq_to_desc(dev_priv->lpe_audio.irq);
-
lpe_audio_platdev_destroy(dev_priv);
 
irq_free_desc(dev_priv->lpe_audio.irq);

[patch 19/30] PCI: mobiveil: Use irq_data_get_irq_chip_data()

2020-12-10 Thread Thomas Gleixner

Going through a full irq descriptor lookup instead of just using the proper
helper function which provides direct access is suboptimal.

In fact it _is_ wrong because the chip callback needs to get the chip data
which is relevant for the chip while using the irq descriptor variant
returns the irq chip data of the top level chip of a hierarchy. It does not
matter in this case because the chip is the top level chip, but that
doesn't make it more correct.

Signed-off-by: Thomas Gleixner 
Cc: Karthikeyan Mitran 
Cc: Hou Zhiqiang 
Cc: Lorenzo Pieralisi 
Cc: Rob Herring 
Cc: Bjorn Helgaas 
Cc: linux-...@vger.kernel.org
---
 drivers/pci/controller/mobiveil/pcie-mobiveil-host.c |8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

--- a/drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
+++ b/drivers/pci/controller/mobiveil/pcie-mobiveil-host.c
@@ -306,13 +306,11 @@ int mobiveil_host_init(struct mobiveil_p
 
 static void mobiveil_mask_intx_irq(struct irq_data *data)
 {
-   struct irq_desc *desc = irq_to_desc(data->irq);
-   struct mobiveil_pcie *pcie;
+   struct mobiveil_pcie *pcie = irq_data_get_irq_chip_data(data);
struct mobiveil_root_port *rp;
unsigned long flags;
u32 mask, shifted_val;
 
-   pcie = irq_desc_get_chip_data(desc);
rp = &pcie->rp;
mask = 1 << ((data->hwirq + PAB_INTX_START) - 1);
raw_spin_lock_irqsave(&rp->intx_mask_lock, flags);
@@ -324,13 +322,11 @@ static void mobiveil_mask_intx_irq(struc
 
 static void mobiveil_unmask_intx_irq(struct irq_data *data)
 {
-   struct irq_desc *desc = irq_to_desc(data->irq);
-   struct mobiveil_pcie *pcie;
+   struct mobiveil_pcie *pcie = irq_data_get_irq_chip_data(data);
struct mobiveil_root_port *rp;
unsigned long flags;
u32 shifted_val, mask;
 
-   pcie = irq_desc_get_chip_data(desc);
rp = &pcie->rp;
mask = 1 << ((data->hwirq + PAB_INTX_START) - 1);
raw_spin_lock_irqsave(&rp->intx_mask_lock, flags);

[patch 16/30] mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc

2020-12-10 Thread Thomas Gleixner

First of all drivers have absolutely no business to dig into the internals
of an irq descriptor. That's core code and subject to change. All of this
information is readily available to /proc/interrupts in a safe and race
free way.

Remove the inspection code which is a blatant violation of subsystem
boundaries and racy against concurrent modifications of the interrupt
descriptor.

Print the irq line instead so the information can be looked up in a sane
way in /proc/interrupts.

Signed-off-by: Thomas Gleixner 
Cc: Linus Walleij 
Cc: Lee Jones 
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/mfd/ab8500-debugfs.c |   16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

--- a/drivers/mfd/ab8500-debugfs.c
+++ b/drivers/mfd/ab8500-debugfs.c
@@ -1513,24 +1513,14 @@ static int ab8500_interrupts_show(struct
 {
int line;
 
-   seq_puts(s, "name: number:  number of: wake:\n");
+   seq_puts(s, "name: number: irq: number of: wake:\n");
 
for (line = 0; line < num_interrupt_lines; line++) {
-   struct irq_desc *desc = irq_to_desc(line + irq_first);
-
-   seq_printf(s, "%3i:  %6i %4i",
+   seq_printf(s, "%3i:  %6i %4i %4i\n",
   line,
+  line + irq_first,
   num_interrupts[line],
   num_wake_interrupts[line]);
-
-   if (desc && desc->name)
-   seq_printf(s, "-%-8s", desc->name);
-   if (desc && desc->action) {
-   struct irqaction *action = desc->action;
-
-   seq_printf(s, "  %s", action->name);
-   while ((action = action->next) != NULL)
-   seq_printf(s, ", %s", action->name);
}
seq_putc(s, '\n');
}

[patch 30/30] genirq: Remove export of irq_to_desc()

2020-12-10 Thread Thomas Gleixner

No more (ab)use in modules finally. Remove the export so there won't come
new ones.

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/irqdesc.c |1 -
 1 file changed, 1 deletion(-)

--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -352,7 +352,6 @@ struct irq_desc *irq_to_desc(unsigned in
 {
return radix_tree_lookup(&irq_desc_tree, irq);
 }
-EXPORT_SYMBOL(irq_to_desc);
 
 static void delete_irq_desc(unsigned int irq)
 {

[patch 15/30] pinctrl: nomadik: Use irq_has_action()

2020-12-10 Thread Thomas Gleixner

Let the core code do the fiddling with irq_desc.

Signed-off-by: Thomas Gleixner 
Cc: Linus Walleij 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-g...@vger.kernel.org
---
 drivers/pinctrl/nomadik/pinctrl-nomadik.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/pinctrl/nomadik/pinctrl-nomadik.c
+++ b/drivers/pinctrl/nomadik/pinctrl-nomadik.c
@@ -948,7 +948,6 @@ static void nmk_gpio_dbg_show_one(struct
   (mode < 0) ? "unknown" : modes[mode]);
} else {
int irq = chip->to_irq(chip, offset);
-   struct irq_desc *desc = irq_to_desc(irq);
const int pullidx = pull ? 1 : 0;
int val;
static const char * const pulls[] = {
@@ -969,7 +968,7 @@ static void nmk_gpio_dbg_show_one(struct
 * This races with request_irq(), set_irq_type(),
 * and set_irq_wake() ... but those are "rare".
 */
-   if (irq > 0 && desc && desc->action) {
+   if (irq > 0 && irq_has_action(irq)) {
char *trigger;
 
if (nmk_chip->edge_rising & BIT(offset))

[patch 20/30] net/mlx4: Replace irq_to_desc() abuse

2020-12-10 Thread Thomas Gleixner

No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner 
Cc: Tariq Toukan 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
 drivers/net/ethernet/mellanox/mlx4/en_cq.c   |8 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |6 +-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 ++-
 3 files changed, 6 insertions(+), 11 deletions(-)

--- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c
@@ -90,7 +90,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p
int cq_idx)
 {
struct mlx4_en_dev *mdev = priv->mdev;
-   int err = 0;
+   int irq, err = 0;
int timestamp_en = 0;
bool assigned_eq = false;
 
@@ -116,10 +116,8 @@ int mlx4_en_activate_cq(struct mlx4_en_p
 
assigned_eq = true;
}
-
-   cq->irq_desc =
-   irq_to_desc(mlx4_eq_get_irq(mdev->dev,
-   cq->vector));
+   irq = mlx4_eq_get_irq(mdev->dev, cq->vector);
+   cq->aff_mask = irq_get_affinity_mask(irq);
} else {
/* For TX we use the same irq per
ring we assigned for the RX*/
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -959,8 +959,6 @@ int mlx4_en_poll_rx_cq(struct napi_struc
 
/* If we used up all the quota - we're probably not done yet... */
if (done == budget || !clean_complete) {
-   const struct cpumask *aff;
-   struct irq_data *idata;
int cpu_curr;
 
/* in case we got here because of !clean_complete */
@@ -969,10 +967,8 @@ int mlx4_en_poll_rx_cq(struct napi_struc
INC_PERF_COUNTER(priv->pstats.napi_quota);
 
cpu_curr = smp_processor_id();
-   idata = irq_desc_get_irq_data(cq->irq_desc);
-   aff = irq_data_get_affinity_mask(idata);
 
-   if (likely(cpumask_test_cpu(cpu_curr, aff)))
+   if (likely(cpumask_test_cpu(cpu_curr, cq->aff_mask)))
return budget;
 
/* Current cpu is not according to smp_irq_affinity -
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -46,6 +46,7 @@
 #endif
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -380,7 +381,7 @@ struct mlx4_en_cq {
struct mlx4_cqe *buf;
 #define MLX4_EN_OPCODE_ERROR   0x1e
 
-   struct irq_desc *irq_desc;
+   const struct cpumask *aff_mask;
 };
 
 struct mlx4_en_port_profile {

[patch 23/30] net/mlx5: Use effective interrupt affinity

2020-12-10 Thread Thomas Gleixner

Using the interrupt affinity mask for checking locality is not really
working well on architectures which support effective affinity masks.

The affinity mask is either the system wide default or set by user space,
but the architecture can or even must reduce the mask to the effective set,
which means that checking the affinity mask itself does not really tell
about the actual target CPUs.

Signed-off-by: Thomas Gleixner 
Cc: Saeed Mahameed 
Cc: Leon Romanovsky 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: netdev@vger.kernel.org
Cc: linux-r...@vger.kernel.org
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx
c->num_tc   = params->num_tc;
c->xdp  = !!params->xdp_prog;
c->stats= &priv->channel_stats[ix].ch;
-   c->aff_mask = irq_get_affinity_mask(irq);
+   c->aff_mask = irq_get_effective_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
 
netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);

[patch 22/30] net/mlx5: Replace irq_to_desc() abuse

2020-12-10 Thread Thomas Gleixner

No driver has any business with the internals of an interrupt
descriptor. Storing a pointer to it just to use yet another helper at the
actual usage site to retrieve the affinity mask is creative at best. Just
because C does not allow encapsulation does not mean that the kernel has no
limits.

Retrieve a pointer to the affinity mask itself and use that. It's still
using an interface which is usually not for random drivers, but definitely
less hideous than the previous hack.

Signed-off-by: Thomas Gleixner 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |6 +-
 3 files changed, 3 insertions(+), 7 deletions(-)

--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -669,7 +669,7 @@ struct mlx5e_channel {
spinlock_t async_icosq_lock;
 
/* data path - accessed per napi poll */
-   struct irq_desc *irq_desc;
+   const struct cpumask  *aff_mask;
struct mlx5e_ch_stats *stats;
 
/* control */
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx
c->num_tc   = params->num_tc;
c->xdp  = !!params->xdp_prog;
c->stats= &priv->channel_stats[ix].ch;
-   c->irq_desc = irq_to_desc(irq);
+   c->aff_mask = irq_get_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix);
 
netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -40,12 +40,8 @@
 static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c)
 {
int current_cpu = smp_processor_id();
-   const struct cpumask *aff;
-   struct irq_data *idata;
 
-   idata = irq_desc_get_irq_data(c->irq_desc);
-   aff = irq_data_get_affinity_mask(idata);
-   return cpumask_test_cpu(current_cpu, aff);
+   return cpumask_test_cpu(current_cpu, c->aff_mask);
 }
 
 static void mlx5e_handle_tx_dim(struct mlx5e_txqsq *sq)

[patch 24/30] xen/events: Remove unused bind_evtchn_to_irq_lateeoi()

2020-12-10 Thread Thomas Gleixner

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |6 --
 1 file changed, 6 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -1132,12 +1132,6 @@ int bind_evtchn_to_irq(evtchn_port_t evt
 }
 EXPORT_SYMBOL_GPL(bind_evtchn_to_irq);
 
-int bind_evtchn_to_irq_lateeoi(evtchn_port_t evtchn)
-{
-   return bind_evtchn_to_irq_chip(evtchn, &xen_lateeoi_chip);
-}
-EXPORT_SYMBOL_GPL(bind_evtchn_to_irq_lateeoi);
-
 static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu)
 {
struct evtchn_bind_ipi bind_ipi;

[patch 17/30] NTB/msi: Use irq_has_action()

2020-12-10 Thread Thomas Gleixner

Use the proper core function.

Signed-off-by: Thomas Gleixner 
Cc: Jon Mason 
Cc: Dave Jiang 
Cc: Allen Hubbe 
Cc: linux-...@googlegroups.com
---
 drivers/ntb/msi.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/drivers/ntb/msi.c
+++ b/drivers/ntb/msi.c
@@ -282,15 +282,13 @@ int ntbm_msi_request_threaded_irq(struct
  struct ntb_msi_desc *msi_desc)
 {
struct msi_desc *entry;
-   struct irq_desc *desc;
int ret;
 
if (!ntb->msi)
return -EINVAL;
 
for_each_pci_msi_entry(entry, ntb->pdev) {
-   desc = irq_to_desc(entry->irq);
-   if (desc->action)
+   if (irq_has_action(entry->irq))
continue;
 
ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,

[patch 28/30] xen/events: Reduce irq_info::spurious_cnt storage size

2020-12-10 Thread Thomas Gleixner

To prepare for interrupt spreading reduce the storage size of
irq_info::spurious_cnt to u8 so the required flag for the spreading logic
will not increase the storage size.

Protect the usage site against overruns.

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -95,7 +95,7 @@ struct irq_info {
struct list_head list;
struct list_head eoi_list;
short refcnt;
-   short spurious_cnt;
+   u8 spurious_cnt;
enum xen_irq_type type; /* type */
unsigned irq;
evtchn_port_t evtchn;   /* event channel */
@@ -528,8 +528,10 @@ static void xen_irq_lateeoi_locked(struc
return;
 
if (spurious) {
-   if ((1 << info->spurious_cnt) < (HZ << 2))
-   info->spurious_cnt++;
+   if ((1 << info->spurious_cnt) < (HZ << 2)) {
+   if (info->spurious_cnt != 0xFF)
+   info->spurious_cnt++;
+   }
if (info->spurious_cnt > 1) {
delay = 1 << (info->spurious_cnt - 2);
if (delay > HZ)

[patch 29/30] xen/events: Implement irq distribution

2020-12-10 Thread Thomas Gleixner

Keep track of the assignments of event channels to CPUs and select the
online CPU with the least assigned channels in the affinity mask which is
handed to irq_chip::irq_set_affinity() from the core code.

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |   72 ++-
 1 file changed, 64 insertions(+), 8 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -96,6 +96,7 @@ struct irq_info {
struct list_head eoi_list;
short refcnt;
u8 spurious_cnt;
+   u8 is_accounted;
enum xen_irq_type type; /* type */
unsigned irq;
evtchn_port_t evtchn;   /* event channel */
@@ -161,6 +162,9 @@ static DEFINE_PER_CPU(int [NR_VIRQS], vi
 /* IRQ <-> IPI mapping */
 static DEFINE_PER_CPU(int [XEN_NR_IPIS], ipi_to_irq) = {[0 ... XEN_NR_IPIS-1] 
= -1};
 
+/* Event channel distribution data */
+static atomic_t channels_on_cpu[NR_CPUS];
+
 static int **evtchn_to_irq;
 #ifdef CONFIG_X86
 static unsigned long *pirq_eoi_map;
@@ -257,6 +261,32 @@ static void set_info_for_irq(unsigned in
irq_set_chip_data(irq, info);
 }
 
+/* Per CPU channel accounting */
+static void channels_on_cpu_dec(struct irq_info *info)
+{
+   if (!info->is_accounted)
+   return;
+
+   info->is_accounted = 0;
+
+   if (WARN_ON_ONCE(info->cpu >= nr_cpu_ids))
+   return;
+
+   WARN_ON_ONCE(!atomic_add_unless(&channels_on_cpu[info->cpu], -1 , 0));
+}
+
+static void channels_on_cpu_inc(struct irq_info *info)
+{
+   if (WARN_ON_ONCE(info->cpu >= nr_cpu_ids))
+   return;
+
+   if (WARN_ON_ONCE(!atomic_add_unless(&channels_on_cpu[info->cpu], 1,
+   INT_MAX)))
+   return;
+
+   info->is_accounted = 1;
+}
+
 /* Constructors for packed IRQ information. */
 static int xen_irq_info_common_setup(struct irq_info *info,
 unsigned irq,
@@ -339,6 +369,7 @@ static void xen_irq_info_cleanup(struct
 {
set_evtchn_to_irq(info->evtchn, -1);
info->evtchn = 0;
+   channels_on_cpu_dec(info);
 }
 
 /*
@@ -449,7 +480,9 @@ static void bind_evtchn_to_cpu(evtchn_po
 
xen_evtchn_port_bind_to_cpu(evtchn, cpu, info->cpu);
 
+   channels_on_cpu_dec(info);
info->cpu = cpu;
+   channels_on_cpu_inc(info);
 }
 
 /**
@@ -622,11 +655,6 @@ static void xen_irq_init(unsigned irq)
 {
struct irq_info *info;
 
-#ifdef CONFIG_SMP
-   /* By default all event channels notify CPU#0. */
-   cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(0));
-#endif
-
info = kzalloc(sizeof(*info), GFP_KERNEL);
if (info == NULL)
panic("Unable to allocate metadata for IRQ%d\n", irq);
@@ -1691,10 +1719,34 @@ static int xen_rebind_evtchn_to_cpu(evtc
return 0;
 }
 
+/*
+ * Find the CPU within @dest mask which has the least number of channels
+ * assigned. This is not precise as the per cpu counts can be modified
+ * concurrently.
+ */
+static unsigned int select_target_cpu(const struct cpumask *dest)
+{
+   unsigned int cpu, best_cpu = UINT_MAX, minch = UINT_MAX;
+
+   for_each_cpu_and(cpu, dest, cpu_online_mask) {
+   unsigned int curch = atomic_read(&channels_on_cpu[cpu]);
+
+   if (curch < minch) {
+   minch = curch;
+   best_cpu = cpu;
+   }
+   }
+
+   /* If this happens accounting is screwed up */
+   if (WARN_ON_ONCE(best_cpu == UINT_MAX))
+   best_cpu = cpumask_first(dest);
+   return best_cpu;
+}
+
 static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest,
bool force)
 {
-   unsigned tcpu = cpumask_first_and(dest, cpu_online_mask);
+   unsigned int tcpu = select_target_cpu(dest);
int ret;
 
ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu);
@@ -1922,8 +1974,12 @@ void xen_irq_resume(void)
xen_evtchn_resume();
 
/* No IRQ <-> event-channel mappings. */
-   list_for_each_entry(info, &xen_irq_list_head, list)
-   info->evtchn = 0; /* zap event-channel binding */
+   list_for_each_entry(info, &xen_irq_list_head, list) {
+   /* Zap event-channel binding */
+   info->evtchn = 0;
+   /* Adjust accounting */
+   channels_on_cpu_dec(info);
+   }
 
clear_evtchn_to_irq_all();

[patch 27/30] xen/events: Only force affinity mask for percpu interrupts

2020-12-10 Thread Thomas Gleixner

All event channel setups bind the interrupt on CPU0 or the target CPU for
percpu interrupts and overwrite the affinity mask with the corresponding
cpumask. That does not make sense.

The XEN implementation of irqchip::irq_set_affinity() already picks a
single target CPU out of the affinity mask and the actual target is stored
in the effective CPU mask, so destroying the user chosen affinity mask
which might contain more than one CPU is wrong.

Change the implementation so that the channel is bound to CPU0 at the XEN
level and leave the affinity mask alone. At startup of the interrupt
affinity will be assigned out of the affinity mask and the XEN binding will
be updated. Only keep the enforcement for real percpu interrupts.

On resume the overwrite is not required either because info->cpu and the
affinity mask are still the same as at the time of suspend. Same for
rebind_evtchn_irq().

This also prepares for proper interrupt spreading.

Signed-off-by: Thomas Gleixner 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
---
 drivers/xen/events/events_base.c |   42 ++-
 1 file changed, 28 insertions(+), 14 deletions(-)

--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -433,15 +433,20 @@ static bool pirq_needs_eoi_flag(unsigned
return info->u.pirq.flags & PIRQ_NEEDS_EOI;
 }
 
-static void bind_evtchn_to_cpu(evtchn_port_t evtchn, unsigned int cpu)
+static void bind_evtchn_to_cpu(evtchn_port_t evtchn, unsigned int cpu,
+  bool force_affinity)
 {
int irq = get_evtchn_to_irq(evtchn);
struct irq_info *info = info_for_irq(irq);
 
BUG_ON(irq == -1);
-#ifdef CONFIG_SMP
-   cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(cpu));
-#endif
+
+   if (IS_ENABLED(CONFIG_SMP) && force_affinity) {
+   cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(cpu));
+   cpumask_copy(irq_get_effective_affinity_mask(irq),
+cpumask_of(cpu));
+   }
+
xen_evtchn_port_bind_to_cpu(evtchn, cpu, info->cpu);
 
info->cpu = cpu;
@@ -788,7 +793,7 @@ static unsigned int __startup_pirq(unsig
goto err;
 
info->evtchn = evtchn;
-   bind_evtchn_to_cpu(evtchn, 0);
+   bind_evtchn_to_cpu(evtchn, 0, false);
 
rc = xen_evtchn_port_setup(evtchn);
if (rc)
@@ -1107,8 +1112,8 @@ static int bind_evtchn_to_irq_chip(evtch
irq = ret;
goto out;
}
-   /* New interdomain events are bound to VCPU 0. */
-   bind_evtchn_to_cpu(evtchn, 0);
+   /* New interdomain events are initially bound to VCPU 0. */
+   bind_evtchn_to_cpu(evtchn, 0, false);
} else {
struct irq_info *info = info_for_irq(irq);
WARN_ON(info == NULL || info->type != IRQT_EVTCHN);
@@ -1156,7 +1161,11 @@ static int bind_ipi_to_irq(unsigned int
irq = ret;
goto out;
}
-   bind_evtchn_to_cpu(evtchn, cpu);
+   /*
+* Force the affinity mask to the target CPU so proc shows
+* the correct target.
+*/
+   bind_evtchn_to_cpu(evtchn, cpu, true);
} else {
struct irq_info *info = info_for_irq(irq);
WARN_ON(info == NULL || info->type != IRQT_IPI);
@@ -1269,7 +1278,11 @@ int bind_virq_to_irq(unsigned int virq,
goto out;
}
 
-   bind_evtchn_to_cpu(evtchn, cpu);
+   /*
+* Force the affinity mask for percpu interrupts so proc
+* shows the correct target.
+*/
+   bind_evtchn_to_cpu(evtchn, cpu, percpu);
} else {
struct irq_info *info = info_for_irq(irq);
WARN_ON(info == NULL || info->type != IRQT_VIRQ);
@@ -1634,8 +1647,7 @@ void rebind_evtchn_irq(evtchn_port_t evt
 
mutex_unlock(&irq_mapping_update_lock);
 
-bind_evtchn_to_cpu(evtchn, info->cpu);
-   irq_set_affinity(irq, cpumask_of(info->cpu));
+   bind_evtchn_to_cpu(evtchn, info->cpu, false);
 
/* Unmask the event channel. */
enable_irq(irq);
@@ -1669,7 +1681,7 @@ static int xen_rebind_evtchn_to_cpu(evtc
 * it, but don't do the xenlinux-level rebind in that case.
 */
if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0)
-   bind_evtchn_to_cpu(evtchn, tcpu);
+   bind_evtchn_to_cpu(evtchn, tcpu, false);
 
if (!masked)
unmask_evtchn(evtchn);
@@ -1798,7 +1810,8 @@ static void restore_cpu_virqs(unsigned i
 
/* Record the new mapping. */

[patch 09/30] ARM: smp: Use irq_desc_kstat_cpu() in show_ipi_list()

2020-12-10 Thread Thomas Gleixner

The irq descriptor is already there, no need to look it up again.

Signed-off-by: Thomas Gleixner 
Cc: Marc Zyngier 
Cc: Russell King 
Cc: linux-arm-ker...@lists.infradead.org
---
 arch/arm/kernel/smp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -550,7 +550,7 @@ void show_ipi_list(struct seq_file *p, i
seq_printf(p, "%*s%u: ", prec - 1, "IPI", i);
 
for_each_online_cpu(cpu)
-   seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu));
+   seq_printf(p, "%10u ", irq_desc_kstat_cpu(ipi_desc[i], 
cpu));
 
seq_printf(p, " %s\n", ipi_types[i]);
}

[patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy

2020-12-10 Thread Thomas Gleixner

Driver code has no business with the internals of the irq descriptor.

Aside of that the count is per interrupt line and therefore takes
interrupts from other devices into account which share the interrupt line
and are not handled by the graphics driver.

Replace it with a pmu private count which only counts interrupts which
originate from the graphics card.

To avoid atomics or heuristics of some sort make the counter field
'unsigned long'. That limits the count to 4e9 on 32bit which is a lot and
postprocessing can easily deal with the occasional wraparound.

Signed-off-by: Thomas Gleixner 
Cc: Tvrtko Ursulin 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
 drivers/gpu/drm/i915/i915_irq.c |   34 ++
 drivers/gpu/drm/i915/i915_pmu.c |   18 +-
 drivers/gpu/drm/i915/i915_pmu.h |8 
 3 files changed, 43 insertions(+), 17 deletions(-)

--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -60,6 +60,24 @@
  * and related files, but that will be described in separate chapters.
  */
 
+/*
+ * Interrupt statistic for PMU. Increments the counter only if the
+ * interrupt originated from the the GPU so interrupts from a device which
+ * shares the interrupt line are not accounted.
+ */
+static inline void pmu_irq_stats(struct drm_i915_private *priv,
+irqreturn_t res)
+{
+   if (unlikely(res != IRQ_HANDLED))
+   return;
+
+   /*
+* A clever compiler translates that into INC. A not so clever one
+* should at least prevent store tearing.
+*/
+   WRITE_ONCE(priv->pmu.irq_count, priv->pmu.irq_count + 1);
+}
+
 typedef bool (*long_pulse_detect_func)(enum hpd_pin pin, u32 val);
 
 static const u32 hpd_ilk[HPD_NUM_PINS] = {
@@ -1599,6 +1617,8 @@ static irqreturn_t valleyview_irq_handle
valleyview_pipestat_irq_handler(dev_priv, pipe_stats);
} while (0);
 
+   pmu_irq_stats(dev_priv, ret);
+
enable_rpm_wakeref_asserts(&dev_priv->runtime_pm);
 
return ret;
@@ -1676,6 +1696,8 @@ static irqreturn_t cherryview_irq_handle
valleyview_pipestat_irq_handler(dev_priv, pipe_stats);
} while (0);
 
+   pmu_irq_stats(dev_priv, ret);
+
enable_rpm_wakeref_asserts(&dev_priv->runtime_pm);
 
return ret;
@@ -2103,6 +2125,8 @@ static irqreturn_t ilk_irq_handler(int i
if (sde_ier)
raw_reg_write(regs, SDEIER, sde_ier);
 
+   pmu_irq_stats(i915, ret);
+
/* IRQs are synced during runtime_suspend, we don't require a wakeref */
enable_rpm_wakeref_asserts(&i915->runtime_pm);
 
@@ -2419,6 +2443,8 @@ static irqreturn_t gen8_irq_handler(int
 
gen8_master_intr_enable(regs);
 
+   pmu_irq_stats(dev_priv, IRQ_HANDLED);
+
return IRQ_HANDLED;
 }
 
@@ -2514,6 +2540,8 @@ static __always_inline irqreturn_t
 
gen11_gu_misc_irq_handler(gt, gu_misc_iir);
 
+   pmu_irq_stats(i915, IRQ_HANDLED);
+
return IRQ_HANDLED;
 }
 
@@ -3688,6 +3716,8 @@ static irqreturn_t i8xx_irq_handler(int
i8xx_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
 
+   pmu_irq_stats(dev_priv, ret);
+
enable_rpm_wakeref_asserts(&dev_priv->runtime_pm);
 
return ret;
@@ -3796,6 +3826,8 @@ static irqreturn_t i915_irq_handler(int
i915_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
 
+   pmu_irq_stats(dev_priv, ret);
+
enable_rpm_wakeref_asserts(&dev_priv->runtime_pm);
 
return ret;
@@ -3941,6 +3973,8 @@ static irqreturn_t i965_irq_handler(int
i965_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
 
+   pmu_irq_stats(dev_priv, IRQ_HANDLED);
+
enable_rpm_wakeref_asserts(&dev_priv->runtime_pm);
 
return ret;
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample(
return HRTIMER_RESTART;
 }
 
-static u64 count_interrupts(struct drm_i915_private *i915)
-{
-   /* open-coded kstat_irqs() */
-   struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq);
-   u64 sum = 0;
-   int cpu;
-
-   if (!desc || !desc->kstat_irqs)
-   return 0;
-
-   for_each_possible_cpu(cpu)
-   sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
-
-   return sum;
-}
-
 static void i915_pmu_event_destroy(struct perf_event *event)
 {
struct drm_i915_private *i915 =
@@ -581,7 +565,7 @@ static u64 __i915_pmu_event_read(struct
   USEC_PER_SEC /* to MHz */);
break;
case I915_PMU_INTERRUPTS:
-   val = cou

[patch 05/30] genirq: Annotate irq stats data races

2020-12-10 Thread Thomas Gleixner

Both the per cpu stats and the accumulated count are accessed lockless and
can be concurrently modified. That's intentional and the stats are a rough
estimate anyway. Annotate them with data_race().

Signed-off-by: Thomas Gleixner 
---
 kernel/irq/irqdesc.c |4 ++--
 kernel/irq/proc.c|5 +++--
 2 files changed, 5 insertions(+), 4 deletions(-)

--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -943,10 +943,10 @@ unsigned int kstat_irqs(unsigned int irq
if (!irq_settings_is_per_cpu_devid(desc) &&
!irq_settings_is_per_cpu(desc) &&
!irq_is_nmi(desc))
-   return desc->tot_count;
+   return data_race(desc->tot_count);
 
for_each_possible_cpu(cpu)
-   sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
+   sum += data_race(*per_cpu_ptr(desc->kstat_irqs, cpu));
return sum;
 }
 
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -488,9 +488,10 @@ int show_interrupts(struct seq_file *p,
if (!desc || irq_settings_is_hidden(desc))
goto outsparse;
 
-   if (desc->kstat_irqs)
+   if (desc->kstat_irqs) {
for_each_online_cpu(j)
-   any_count |= *per_cpu_ptr(desc->kstat_irqs, j);
+   any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, 
j));
+   }
 
if ((!desc->action || irq_desc_is_chained(desc)) && !any_count)
goto outsparse;

[patch 04/30] genirq: Provide irq_get_effective_affinity()

2020-12-10 Thread Thomas Gleixner

Provide an accessor to the effective interrupt affinity mask. Going to be
used to replace open coded fiddling with the irq descriptor.

Signed-off-by: Thomas Gleixner 
---
 include/linux/irq.h |7 +++
 1 file changed, 7 insertions(+)

--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -907,6 +907,13 @@ struct cpumask *irq_data_get_effective_a
 }
 #endif
 
+static inline struct cpumask *irq_get_effective_affinity_mask(unsigned int irq)
+{
+   struct irq_data *d = irq_get_irq_data(irq);
+
+   return d ? irq_data_get_effective_affinity_mask(d) : NULL;
+}
+
 unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,

[patch 00/30] genirq: Treewide hunt for irq descriptor abuse and assorted fixes

2020-12-10 Thread Thomas Gleixner

A recent request to export kstat_irqs() pointed to a copy of the same in
the i915 code, which made me look for further usage of irq descriptors in
drivers.

The usage in drivers ranges from creative to broken in all colours.

irqdesc.h clearly says that this is core functionality and the fact C does
not allow full encapsulation is not a justification to fiddle with it just
because. It took us a lot of effort to make the core functionality provide
what drivers need.

If there is a shortcoming, it's not asked too much to talk to the relevant
maintainers instead of going off and fiddling with the guts of interrupt
descriptors and often enough without understanding lifetime and locking
rules.

As people insist on not respecting boundaries, this series cleans up the
(ab)use and at the end removes the export of irq_to_desc() to make it at
least harder. All legitimate users of this are built in.

While at it I stumbled over some other oddities related to interrupt
counting and cleaned them up as well.

The series applies on top of

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

and is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git genirq

Thanks,

tglx
---
 arch/alpha/kernel/sys_jensen.c   |2 
 arch/arm/kernel/smp.c|2 
 arch/parisc/kernel/irq.c |7 
 arch/s390/kernel/irq.c   |2 
 arch/x86/kernel/topology.c   |1 
 arch/arm64/kernel/smp.c  |2 
 drivers/gpu/drm/i915/display/intel_lpe_audio.c   |4 
 drivers/gpu/drm/i915/i915_irq.c  |   34 +++
 drivers/gpu/drm/i915/i915_pmu.c  |   18 -
 drivers/gpu/drm/i915/i915_pmu.h  |8 
 drivers/mfd/ab8500-debugfs.c |   16 -
 drivers/net/ethernet/mellanox/mlx4/en_cq.c   |8 
 drivers/net/ethernet/mellanox/mlx4/en_rx.c   |6 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 
 drivers/net/ethernet/mellanox/mlx5/core/en.h |2 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|2 
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c|6 
 drivers/ntb/msi.c|4 
 drivers/pci/controller/mobiveil/pcie-mobiveil-host.c |8 
 drivers/pci/controller/pcie-xilinx-nwl.c |8 
 drivers/pinctrl/nomadik/pinctrl-nomadik.c|3 
 drivers/xen/events/events_base.c |  172 +++
 drivers/xen/evtchn.c |   34 ---
 include/linux/interrupt.h|1 
 include/linux/irq.h  |7 
 include/linux/irqdesc.h  |   40 +---
 include/linux/kernel_stat.h  |1 
 kernel/irq/irqdesc.c |   42 ++--
 kernel/irq/manage.c  |   37 
 kernel/irq/proc.c|5 
 30 files changed, 263 insertions(+), 222 deletions(-)

Re: [PATCH v5 0/9] "Task_isolation" mode

2020-12-05 Thread Thomas Gleixner

Pavel,

On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote:
> So... what kind of guarantees does this aim to provide / what tasks it
> is useful for?
>
> For real time response, we have other approaches.

Depends on your requirements. Some problems are actually better solved
with busy polling. See below.

> If you want to guarantee performnace of the "isolated" task... I don't
> see how that works. Other tasks on the system still compete for DRAM
> bandwidth, caches, etc...

Applications which want to run as undisturbed as possible. There is
quite a range of those:

  - Hardware in the loop simulation is today often done with that crude
approach of "offlining" a CPU and then instead of playing dead
jumping to a preloaded bare metal executable. That's a horrible hack
and impossible to debug, but gives them the results they need to
achieve. These applications are well optimized vs. cache and memory
foot print, so they don't worry about these things too much and they
surely don't run on SMI and BIOS value add inflicted machines.

Don't even think about waiting for an interrupt to achieve what
these folks are doing. So no, there are problems which a general
purpose realtime OS cannot solve ever.

  - HPC computations on large data sets. While the memory foot print is
large the access patterns are cache optimized. 

The problem there is that any unnecessary IPI, tick interrupt or
whatever nuisance is disturbing the carefully optimized cache usage
and alone getting rid of the timer interrupt gained them measurable
performance. Even very low single digit percentage of runtime saving
is valuable for these folks because the compute time on such beasts
is expensive.

  - Realtime guests in KVM. With posted interrupts and a fully populated
host side page table there is no point in running host side
interrupts or IPIs for random accounting or whatever purposes as
they affect the latency in the guest. With all the side effects
mitigated and a properly set up guest and host it is possible to get
to a zero exit situation after the bootup phase which means pretty
much matching bare metal behaviour.

Yes, you can do that with e.g. Jailhouse as well, but you lose lots
of the fancy things KVM provides. And people care about these not
just because they are fancy. They care because their application
scenario needs them.

There are more reasons why people want to be able to get as much
isolation from the OS as possible but at the same time have a sane
execution environment, debugging, performance monitoring and the OS
provided protection mechanisms instead of horrible hacks.

Isolation makes sense for a range of applications and there is no reason
why Linux should not support them. 

Thanks,

tglx

Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-11-23 Thread Thomas Gleixner

On Mon, Nov 23 2020 at 17:58, Alex Belits wrote:
> From: Yuri Norov 
>
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
>
> This patch adds check for it.

git grep 'This patch' Documentation/process/

>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> - if (!tick_nohz_full_cpu(cpu))
> + smp_rmb();

Undocumented smp_rmb() ...

> + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>   return;

I still have to see a convincing argument why task isolation is special
and not just a straight forward extension of NOHZ full cpu isolation.

It's not special as much as you want it to be special.

Thanks,

tglx

Re: [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code

2020-11-23 Thread Thomas Gleixner

Alex,

On Mon, Nov 23 2020 at 17:57, Alex Belits wrote:
> Kernel entry and exit functions for task isolation are added to context
> tracking and common entry points. Common handling of pending work on exit
> to userspace now processes isolation breaking, cleanup and start.

Again: You fail to explain the rationale and just explain what the patch
is doing. I can see what the patch is doing from the patch itself.

> ---
>  include/linux/hardirq.h   |  2 ++
>  include/linux/sched.h |  2 ++
>  kernel/context_tracking.c |  5 +
>  kernel/entry/common.c | 10 +-
>  kernel/irq/irqdesc.c  |  5 +

At least 3 different subsystems, which means this again failed to be
split into seperate patches.

>  extern void synchronize_irq(unsigned int irq);
> @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void);
>   do {\
>   lockdep_off();  \
>   arch_nmi_enter();   \
> + task_isolation_kernel_enter();  \

Where is the explanation why this is safe and correct vs. this fragile
code path?

> @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, 
> struct task_struct *tsk);
>  #ifdef CONFIG_SMP
>  static __always_inline void scheduler_ipi(void)
>  {
> + task_isolation_kernel_enter();

Why is the scheduler_ipi() special? Just because everything else cannot
happen at all? Oh well...

>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state 
> state)
>   __this_cpu_write(context_tracking.state, state);
>   }
>   context_tracking_recursion_exit();
> +
> + task_isolation_exit_to_user_mode();

Why is this here at all and why is it outside of the recursion
protection

>  }
>  EXPORT_SYMBOL_GPL(__context_tracking_enter);
>  
> @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state)
>   if (!context_tracking_recursion_enter())
>   return;
>  
> + task_isolation_kernel_enter();

while this is inside?

And why has the scheduler_ipi() on x86 call this twice? Just because?

>   if (__this_cpu_read(context_tracking.state) == state) {
>   if (__this_cpu_read(context_tracking.active)) {
>   /*
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
>  static void exit_to_user_mode_prepare(struct pt_regs *regs)
>  {
> - unsigned long ti_work = READ_ONCE(current_thread_info()->flags);
> + unsigned long ti_work;
>  
>   lockdep_assert_irqs_disabled();
>  
> + task_isolation_before_pending_work_check();
> +
> + ti_work = READ_ONCE(current_thread_info()->flags);
> +
>   if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
>   ti_work = exit_to_user_mode_loop(regs, ti_work);
>  
> + if (unlikely(ti_work & _TIF_TASK_ISOLATION))
> + task_isolation_start();

Where is the explaination of this change?

Aside of that how does anything of this compile on x86 at all?

Answer: It does not ...

Stop this frenzy right now. It's going nowhere and all you achieve is to
make people more grumpy than they are already.

Thanks,

tglx

Re: [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel

2020-11-23 Thread Thomas Gleixner

Alex,

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:
>  .../admin-guide/kernel-parameters.txt |   6 +
>  drivers/base/cpu.c|  23 +
>  include/linux/hrtimer.h   |   4 +
>  include/linux/isolation.h | 326 
>  include/linux/sched.h |   5 +
>  include/linux/tick.h  |   3 +
>  include/uapi/linux/prctl.h|   6 +
>  init/Kconfig  |  27 +
>  kernel/Makefile   |   2 +
>  kernel/isolation.c| 714 ++
>  kernel/signal.c   |   2 +
>  kernel/sys.c  |   6 +
>  kernel/time/hrtimer.c |  27 +
>  kernel/time/tick-sched.c  |  18 +

I asked you before to split this up into bits and pieces and argue and
justify each change. Throwing this wholesale over the fence is going
nowhere. It's not revieable at all.

Aside of that ignoring review comments is a sure path to make yourself
ignored:

> +/*
> + * Logging
> + */
> +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, 
> ...);
> +
> +#define pr_task_isol_emerg(cpu, fmt, ...)\
> + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__)

The comments various people made about that are not going away and none
of this is going near anything I'm responsible for unless you provide
these independent of the rest and with a reasonable justification why
you can't use any other existing mechanism or extend it for your use
case.

Thanks,

tglx

Re: [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function

2020-11-23 Thread Thomas Gleixner

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:
> This function checks to see if a vmstat worker is not running,
> and the vmstat diffs don't require an update.  The function is
> called from the task-isolation code to see if we need to
> actually do some work to quiet vmstat.

A changelog has to explain _WHY_ this change is necessary and not _WHAT_
the patch is doing.

Thanks,

tglx

Re: [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function

2020-11-23 Thread Thomas Gleixner

Alex,

On Mon, Nov 23 2020 at 17:56, Alex Belits wrote:

why are you insisting on adding 'task_isolation: ' as prefix to every
single patch? That's wrong as I explained before.

The prefix denotes the affected subsystem and 'task_isolation' is _NOT_
a subsystem. It's the project name you are using but the affected code
belongs to the memory management subsystem and if you run

 git log mm/vmstat.c

you might get a hint what the proper prefix is, i.e. 'mm/vmstat: '

> In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter")
> the quiet_vmstat() function became asynchronous, in the sense that
> the vmstat work was still scheduled to run on the core when the
> function returned.  For task isolation, we need a synchronous

This changelog is useless because how should someone not familiar with
the term 'task isolation' figure out what that means?

It's not the reviewers job to figure that out. Again: Go read and adhere
to Documentation/process/*

Aside of that your patches are CR/LF inflicted. Please fix your work
flow and tools.

Thanks,

tglx

Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-10 Thread Thomas Gleixner

On Mon, Nov 09 2020 at 20:59, Ira Weiny wrote:
> On Tue, Nov 10, 2020 at 02:13:56AM +0100, Thomas Gleixner wrote:
> Also, we can convert the new memcpy_*_page() calls to kmap_local() as well.
> [For now my patch just uses kmap_atomic().]
>
> I've not looked at all of the patches in your latest version.  Have you
> included converting any of the kmap() call sites?  I thought you were more
> focused on converting the kmap_atomic() to kmap_local()?

I did not touch any of those yet, but it's a logical consequence to
convert all kmap() instances which are _not_ creating a global mapping
over to it.

Thanks,

tglx

Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread

2020-11-09 Thread Thomas Gleixner

Ira,

On Fri, Oct 09 2020 at 12:49, ira weiny wrote:
> From: Ira Weiny 
>
> To correctly support the semantics of kmap() with Kernel protection keys
> (PKS), kmap() may be required to set the protections on multiple
> processors (globally).  Enabling PKS globally can be very expensive
> depending on the requested operation.  Furthermore, enabling a domain
> globally reduces the protection afforded by PKS.
>
> Most kmap() (Aprox 209 of 229) callers use the map within a single thread and
> have no need for the protection domain to be enabled globally.  However, the
> remaining callers do not follow this pattern and, as best I can tell, expect
> the mapping to be 'global' and available to any thread who may access the
> mapping.[1]
>
> We don't anticipate global mappings to pmem, however in general there is a
> danger in changing the semantics of kmap().  Effectively, this would cause an
> unresolved page fault with little to no information about why the failure
> occurred.
>
> To resolve this a number of options were considered.
>
> 1) Attempt to change all the thread local kmap() calls to kmap_atomic()[2]
> 2) Introduce a flags parameter to kmap() to indicate if the mapping should be
>global or not
> 3) Change ~20 call sites to 'kmap_global()' to indicate that they require a
>global enablement of the pages.
> 4) Change ~209 call sites to 'kmap_thread()' to indicate that the mapping is 
> to
>be used within that thread of execution only
>
> Option 1 is simply not feasible.  Option 2 would require all of the call sites
> of kmap() to change.  Option 3 seems like a good minimal change but there is a
> danger that new code may miss the semantic change of kmap() and not get the
> behavior the developer intended.  Therefore, #4 was chosen.

There is Option #5:

Convert the thread local kmap() invocations to the proposed kmap_local()
interface which is coming along [1].

That solves a couple of issues:

 1) It relieves the current kmap_atomic() usage sites from the implict
pagefault/preempt disable semantics which apply even when
CONFIG_HIGHMEM is disabled. kmap_local() still can be invoked from
atomic context.

 2) Due to #1 it allows to replace the conditional usage of kmap() and
kmap_atomic() for purely thread local mappings.

 3) It puts the burden on the HIGHMEM inflicted systems

 4) It is actually more efficient for most of the pure thread local use
cases on HIGHMEM inflicted systems because it avoids the overhead of
the global lock and the potential kmap slot exhaustion. A potential
preemption will be more expensive, but that's not really the case we
want to optimize for.

 5) It solves the RT issue vs. kmap_atomic()

So instead of creating yet another variety of kmap() which is just
scratching the particular PKRS itch, can we please consolidate all of
that on the wider reaching kmap_local() approach?

Thanks,

tglx

[1] https://lore.kernel.org/lkml/20201103092712.714480...@linutronix.de/

Re: [PATCH net] r8169: fix operation under forced interrupt threading

2020-10-29 Thread Thomas Gleixner

On Thu, Oct 29 2020 at 11:19, Heiner Kallweit wrote:
> On 29.10.2020 10:42, Thomas Gleixner wrote:
> Correct, just that the legacy PCI interrupt scenario doesn't affect old
> systems/devices only. Users may run the system with nomsi for
> whatever reason and we need to be prepared.
>
> We could add handling for (pcidev->msi_enabled || pcidev->msix_enabled),
> but this would look somewhat hacky to me.

Well, there are quite some drivers which differentiate between MSI and
legacy interrupts, most of them because MSI allows them to split
handlers. So it's not completely insane.

Thanks,

tglx

Re: [PATCH net] r8169: fix operation under forced interrupt threading

2020-10-29 Thread Thomas Gleixner

On Thu, Oct 29 2020 at 09:42, Heiner Kallweit wrote:
> On 29.10.2020 00:29, Jakub Kicinski wrote:
>> Other handles may take spin_locks, which will sleep on RT.
>> 
>> I guess we may need to switch away from the _irqoff() variant for
>> drivers with IRQF_SHARED after all :(
>> 
> Right. Unfortunately that's a large number of drivers,
> e.g. pci_request_irq() sets IRQF_SHARED in general.

IRQF_SHARED is not the problem. It only becomes a problem when the
interrupt is actually shared which is only the case with the legacy PCI
interrupt. MSI[X] is not affected at all.

> But at least for now there doesn't seem to be a better way to deal
> with the challenges imposed by forced threading and shared irqs.

We still can do the static key trick, though it's admittedly hacky.

Thanks,

tglx

Re: [PATCH net] r8169: fix operation under forced interrupt threading

2020-10-28 Thread Thomas Gleixner

On Wed, Oct 28 2020 at 13:17, Heiner Kallweit wrote:
> On 28.10.2020 12:43, Serge Belyshev wrote:
>>> For several network drivers it was reported that using
>>> __napi_schedule_irqoff() is unsafe with forced threading. One way to
>>> fix this is switching back to __napi_schedule, but then we lose the
>>> benefit of the irqoff version in general. As stated by Eric it doesn't
>>> make sense to make the minimal hard irq handlers in drivers using NAPI
>>> a thread. Therefore ensure that the hard irq handler is never
>>> thread-ified.
>> Hi!  This patch actually breaks r8169 with threadirqs on an old box
>> where it was working before:
>> 
>> [0.00] DMI: Gigabyte Technology Co., Ltd. 
>> GA-MA790FX-DQ6/GA-MA790FX-DQ6, BIOS F7g 07/19/2010
>> ...
>> [1.072676] r8169 :02:00.0 eth0: RTL8168b/8111b, 00:1a:4d:5d:6b:c3, 
>> XID 380, IRQ 18
>> ...
>> [8.850099] genirq: Flags mismatch irq 18. 00010080 (eth0) vs. 2080 
>> (ahci[:05:00.0])
>> 
>> (error is reported to userspace, interface failed to bring up).
>> Reverting the patch fixes the problem.
>> 
> Thanks for the report. On this old chip version MSI is unreliable,
> therefore r8169 falls back to a PCI legacy interrupt. On your system
> this PCI legacy interrupt seems to be shared between network and
> disk. Then the IRQ core tries to threadify the disk interrupt
> (setting IRQF_ONESHOT), whilst the network interrupt doesn't have
> this flag set. This results in the flag mismatch error.
>
> Maybe, if one source of a shared interrupt doesn't allow forced
> threading, this should be applied to the other sources too.
> But this would require a change in the IRQ core, therefore
> +Thomas to get his opinion on the issue.

It's pretty simple. There is no way to fix that at the core
level. Shared interrupts suck and to make them work halfways correct the
sharing devices must have matching and non-competing flags.

Especially for threaded vs. non threaded case. Shared interrupts are
level triggered. So you have a conflict of interests:

The threaded handler requires that the interrupt line is masked until
the thread has completed otherwise the system will suffer an interrupt
storm. The non-threaded want's it to be unmasked after it finished.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-27 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 16:08, Jacob Keller wrote:
> On 10/26/2020 3:49 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
>>> I don't think there is currently a way to control the 
>>> enablement/disablement of
>>> interrupts from the userspace.
>> 
>> You cannot just disable the interrupt. You need to make sure that the
>> associated queue is shutdown or quiesced _before_ the interrupt is shut
>> down.
>
> Could this be handled with a callback to the driver/hw? I know Intel HW
> should support this type of quiesce/shutdown.

We can't have a callback from the interrupt shutdown code as you have to
wait for the queue to drain packets in flight. Something like this

 mark queue as going down (no more tx queueing)
 tell hardware not to route RX packets to it
 consume pending RX
 wait for already queued TX packets to be sent

Look what the block people did. They have a common multi-instance
hotplug state and they register each context (queue) as an instance. The
hotplug core invokes the corresponding callbacks when bringing a CPU up
or when shutting it down.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-27 Thread Thomas Gleixner

On Tue, Oct 27 2020 at 08:47, Marcelo Tosatti wrote:
> On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> However, if per-CPU interrupts are not disabled, then the (for example)
> network device is free to include the CPU in its list of destinations.
> Which would require one to say, configure RPS (or whatever mechanism
> is distributing interrupts).

And why is that a problem? If that's possible then you can prevent
getting RX interrupts already today.

> Hum, it would feel safer (rather than trust the #1 rule to be valid
> in all cases) to ask the driver to disable the interrupt (after shutting
> down queue) for that particular CPU.
>
> BTW, Thomas, software is free to configure a particular MSI-X interrupt
> to point to any CPU:
>
> 10.11 MESSAGE SIGNALLED INTERRUPTS

I know how MSI works :)

> So taking the example where computation happens while isolated and later
> stored via block interface, aren't we restricting the usage scenarios
> by enforcing the "per-CPU queue has interrupt pointing to owner CPU"
> rule?

No. For block this is the ideal configuration (think locality) and it
prevents vector exhaustion. If you make these interrupts freely routable
then you bring back the vector exhaustion problem right away.

Now we already established that networking has different requirements,
so you have to come up with a different solution for it which allows to
work for all use cases.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>
> I think this is close to what I and Marcelo were discussing earlier today
> privately.
>
> I don't think there is currently a way to control the enablement/disablement 
> of
> interrupts from the userspace.

You cannot just disable the interrupt. You need to make sure that the
associated queue is shutdown or quiesced _before_ the interrupt is shut
down.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 15:13, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>> 
>> Needs more thought and information (for network oblivious folks like
>> me).
>
> One piece of information that may be useful is that even tho the RX
> packets may be spread semi-randomly the user space can still control
> which queues are included in the mechanism. There is an indirection
> table in the HW which allows to weigh queues differently, or exclude
> selected queues from the spreading. Other mechanisms exist to filter
> flows onto specific queues.
>
> IOW just because a core has an queue/interrupt does not mean that
> interrupt will ever fire, provided its excluded from RSS.
>
> Another piece is that by default we suggest drivers allocate 8 RX
> queues, and online_cpus TX queues. The number of queues can be
> independently controlled via ethtool -L. Drivers which can't support
> separate queues will default to online_cpus queue pairs, and let
> ethtool -L only set the "combined" parameter.
>
> There are drivers which always allocate online_cpus interrupts, 
> and then some of them will go unused if #qs < #cpus.

Thanks for the enlightment.

> My unpopular opinion is that for networking devices all the heuristics
> we may come up with are going to be a dead end.

I agree. Heuristics suck.

> We need an explicit API to allow users placing queues on cores, and
> use managed IRQs for data queues. (I'm assuming that managed IRQs will
> let us reliably map a MSI-X vector to a core :))

Yes, they allow you to do that. That will need some tweaks to theway
they work today (coming from the strict block mq semantics). You also
need to be aware that managed irqs have also strict semantics vs. CPU
hotplug. If the last CPU in the managed affinity set goes down then the
interrupt is shut down by the irq core which means that you need to
quiesce the associated queue before that happens. When the first CPU
comes online again the interrupt is reenabled, so the queue should be
able to handle it or has ensured that the device does not raise one
before it is able to do so.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>>> Are there drivers which use more than one interrupt per queue? I know
>>> drivers have multiple management interrupts.. and I guess some drivers
>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>> have multiple queues for one interrupt .. I'm not sure how a single
>>> queue with multiple interrupts would work though.
>> 
>> For block there is always one interrupt per queue. Some Network drivers
>> seem to have seperate RX and TX interrupts per queue.
> That's true when thinking of Tx and Rx as a single queue. Another way to
> think about it is "one rx queue" and "one tx queue" each with their own
> interrupt...
>
> Even if there are devices which force there to be exactly queue pairs,
> you could still think of them as separate entities?

Interesting thought.

But as Jakub explained networking queues are fundamentally different
from block queues on the RX side. For block the request issued on queue
X will raise the complete interrupt on queue X.

For networking the TX side will raise the TX interrupt on the queue on
which the packet was queued obviously or should I say hopefully. :)

But incoming packets will be directed to some receive queue based on a
hash or whatever crystallball logic the firmware decided to implement.

Which makes this not really suitable for the managed interrupt and
spreading approach which is used by block-mq. Hrm...

But I still think that for curing that isolation stuff we want at least
some information from the driver. Alternative solution would be to grant
the allocation of interrupts and queues and have some sysfs knob to shut
down queues at runtime. If that shutdown results in releasing the queue
interrupt (via free_irq()) then the vector exhaustion problem goes away.

Needs more thought and information (for network oblivious folks like
me).

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
> On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
>> How does userspace know about the driver internals? Number of management
>> interrupts, optimal number of interrupts per queue?
>
> I guess this is the problem solved in part by the queue management work
> that would make queues a thing that userspace is aware of.
>
> Are there drivers which use more than one interrupt per queue? I know
> drivers have multiple management interrupts.. and I guess some drivers
> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
> have multiple queues for one interrupt .. I'm not sure how a single
> queue with multiple interrupts would work though.

For block there is always one interrupt per queue. Some Network drivers
seem to have seperate RX and TX interrupts per queue.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
>> So without information from the driver which tells what the best number
>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>> more problems than it solves. Regressions guaranteed.
>
> One might want to move from one interrupt per isolated app core
> to zero, or vice versa. It seems that "best number of interrupts 
> is with reduced number of CPUs" information, is therefore in userspace, 
> not in driver...

How does userspace know about the driver internals? Number of management
interrupts, optimal number of interrupts per queue?

>> Managed interrupts base their interrupt allocation and spreading on
>> information which is handed in by the individual driver and not on crude
>> assumptions. They are not imposing restrictions on the use case.
>> 
>> It's perfectly fine for isolated work to save a data set to disk after
>> computation has finished and that just works with the per-cpu I/O queue
>> which is otherwise completely silent. 
>
> Userspace could only change the mask of interrupts which are not 
> triggered by requests from the local CPU (admin, error, mgmt, etc),
> to avoid the vector exhaustion problem.
>
> However, there is no explicit way for userspace to know that, as far as
> i know.
>
>  130:  34845  0  0  0  0  0   
>0  0  IR-PCI-MSI 33554433-edge  nvme0q1
>  131:  0  27062  0  0  0  0   
>0  0  IR-PCI-MSI 33554434-edge  nvme0q2
>  132:  0  0  24393  0  0  0   
>0  0  IR-PCI-MSI 33554435-edge  nvme0q3
>  133:  0  0  0  24313  0  0   
>0  0  IR-PCI-MSI 33554436-edge  nvme0q4
>  134:  0  0  0  0  20608  0   
>0  0  IR-PCI-MSI 33554437-edge  nvme0q5
>  135:  0  0  0  0  0  22163   
>0  0  IR-PCI-MSI 33554438-edge  nvme0q6
>  136:  0  0  0  0  0  0  
> 23020  0  IR-PCI-MSI 33554439-edge  nvme0q7
>  137:  0  0  0  0  0  0   
>0  24285  IR-PCI-MSI 33554440-edge  nvme0q8
>
> Can that be retrieved from PCI-MSI information, or drivers
> have to inform this?

The driver should use a different name for the admin queues.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-26 Thread Thomas Gleixner

On Mon, Oct 26 2020 at 09:35, Nitesh Narayan Lal wrote:
> On 10/23/20 5:00 PM, Thomas Gleixner wrote:
>> An isolated setup, which I'm familiar with, has two housekeeping
>> CPUs. So far I restricted the number of network queues with a module
>> argument to two, which allocates two management interrupts for the
>> device and two interrupts (RX/TX) per queue, i.e. a total of six.
>
> Does it somehow take num_online_cpus() into consideration while deciding
> the number of interrupts to create?

No, I just tell it to create two queues :)

>> So without information from the driver which tells what the best number
>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>> more problems than it solves. Regressions guaranteed.
>
> Indeed.
> I think one commonality among the drivers at the moment is the usage of
> num_online_cpus() to determine the vectors to create.
>
> So, maybe instead of doing this kind of restrictions in a generic level
> API, it will make more sense to do this on a per-device basis by replacing
> the number of online CPUs with the housekeeping CPUs?
>
> This is what I have done in the i40e patch.
> But that still sounds hackish and will impact the performance.

You want an interface which allows the driver to say:

  I need N interrupts for general management and ideally M interrupts
  per queue.

This is similar to the way drivers tell the core code about their
requirements for managed interrupts for the spreading calculation.

>> Managed interrupts base their interrupt allocation and spreading on
>> information which is handed in by the individual driver and not on crude
>> assumptions. They are not imposing restrictions on the use case.
>
> Right, FWIU it is irq_do_set_affinity that prevents the spreading of
> managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
> isn't?

No. Spreading takes possible CPUs into account. HK_FLAG_MANAGED_IRQ does
not influence spreading at all.

It only handles the case that an interrupt is affine to more than one
CPUs and the resulting affinity mask spawns both housekeeping and
isolated CPUs. It then steers the interrupt to the housekeeping CPUs (as
long as there is one online).

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-23 Thread Thomas Gleixner

On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>> So shouldn't we then fix the drivers / interface first, to get rid of
>> this inconsistency?
>>
> Considering we agree that excess vector is a problem that needs to be
> solved across all the drivers and that you are comfortable with the other
> three patches in the set. If I may suggest the following:
>
> - We can pick those three patches for now, as that will atleast fix a
>   driver that is currently impacting RT workloads. Is that a fair
>   expectation?

No. Blindly reducing the maximum vectors to the number of housekeeping
CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
what the right number of interrupts for this situation is.

Many of these drivers need more than queue interrupts, admin, error
interrupt and some operate best with seperate RX/TX interrupts per
queue. They all can "work" with a single PCI interrupt of course, but
the price you pay is performance.

An isolated setup, which I'm familiar with, has two housekeeping
CPUs. So far I restricted the number of network queues with a module
argument to two, which allocates two management interrupts for the
device and two interrupts (RX/TX) per queue, i.e. a total of six.

Now I reduced the number of available interrupts to two according to
your hack, which makes it use one queue RX/TX combined and one
management interrupt. Guess what happens? Network performance tanks to
the points that it breaks a carefully crafted setup.

The same applies to a device which is application specific and wants one
channel including an interrupt per isolated application core. Today I
can isolate 8 out of 12 CPUs and let the device create 8 channels and
set one interrupt and channel affine to each isolated CPU. With your
hack, I get only 4 interrupts and channels. Fail!

You cannot declare that all this is perfectly fine, just because it does
not matter for your particular use case.

So without information from the driver which tells what the best number
of interrupts is with a reduced number of CPUs, this cutoff will cause
more problems than it solves. Regressions guaranteed.

Managed interrupts base their interrupt allocation and spreading on
information which is handed in by the individual driver and not on crude
assumptions. They are not imposing restrictions on the use case.

It's perfectly fine for isolated work to save a data set to disk after
computation has finished and that just works with the per-cpu I/O queue
which is otherwise completely silent. All isolated workers can do the
same in parallel without trampling on each other toes by competing for a
reduced number of queues which are affine to the housekeeper CPUs.

Unfortunately network multi-queue is substantially different from block
multi-queue (as I learned in this conversation), so the concept cannot
be applied one-to-one to networking as is. But there are certainly part
of it which can be reused.

This needs a lot more thought than just these crude hacks.

Especially under the aspect that there are talks about making isolation
runtime switchable. Are you going to rmmod/insmod the i40e network
driver to do so? That's going to work fine if you do that
reconfiguration over network...

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-22 Thread Thomas Gleixner

On Thu, Oct 22 2020 at 09:28, Marcelo Tosatti wrote:
> On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> Aha... But it would be necessary to do that from userspace (for runtime
> isolate/unisolate).

For anything which uses managed interrupts this is a non-problem and
userspace has absolutely no business with it.

Isolation does not shut down queues, at least not the block multi-queue
ones which are only active when I/O is issued from that isolated CPU.

So transitioning out of isolation requires no action at all.

Transitioning in or changing the housekeeping mask needs some trivial
tweak to handle the case where there is an overlap in the cpuset of a
queue (housekeeping and isolated). This is handled already for setup and
affinity changes, but of course not for runtime isolation mask changes,
but that's a trivial thing to do.

What's more interesting is how to deal with the network problem where
there is no guarantee that the "response" ends up on the same queue as
the "request" which is what the block people rely on. And that problem
is not really an interrupt affinity problem in the first place.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-22 Thread Thomas Gleixner

On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
>
> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many

That makes it problematic and is fundamentally different from block I/O.

> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.

Ok.

> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
>
> Such new API has been in the works / discussions for a while now.

If there is anything which needs to be done/extended on the irq side
please let me know.

Thanks

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-21 Thread Thomas Gleixner

On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>> However, IMHO we would still need a logic to prevent the devices from
>> creating excess vectors.
>
> Managed interrupts are preventing exactly that by pinning the interrupts
> and queues to one or a set of CPUs, which prevents vector exhaustion on
> CPU hotplug.
>
> Non-managed, yes that is and always was a problem. One of the reasons
> why managed interrupts exist.

But why is this only a problem for isolation? The very same problem
exists vs. CPU hotplug and therefore hibernation.

On x86 we have at max. 204 vectors available for device interrupts per
CPU. So assumed the only device interrupt in use is networking then any
machine which has more than 204 network interrupts (queues, aux ...)
active will prevent the machine from hibernation.

Aside of that it's silly to have multiple queues targeted at a single
CPU in case of hotplug. And that's not a theoretical problem.  Some
power management schemes shut down sockets when the utilization of a
system is low enough, e.g. outside of working hours.

The whole point of multi-queue is to have locality so that traffic from
a CPU goes through the CPU local queue. What's the point of having two
or more queues on a CPU in case of hotplug?

The right answer to this is to utilize managed interrupts and have
according logic in your network driver to handle CPU hotplug. When a CPU
goes down, then the queue which is associated to that CPU is quiesced
and the interrupt core shuts down the relevant interrupt instead of
moving it to an online CPU (which causes the whole vector exhaustion
problem on x86). When the CPU comes online again, then the interrupt is
reenabled in the core and the driver reactivates the queue.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Thomas Gleixner

On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> On 10/20/20 10:16 AM, Thomas Gleixner wrote:
>> With the above change this will result
>>
>>1  general interrupt which is free movable by user space
>>1  managed interrupts (possible affinity to all 16 CPUs, but routed
>>   to housekeeping CPU as long as there is one online)
>>
>> So the device is now limited to a single queue which also affects the
>> housekeeping CPUs because now they have to share a single queue.
>>
>> With larger machines this gets even worse.
>
> Yes, the change can impact the performance, however, if we don't do that we
> may have a latency impact instead. Specifically, on larger systems where
> most of the CPUs are isolated as we will definitely fail in moving all of the
> IRQs away from the isolated CPUs to the housekeeping.

For non managed interrupts I agree.

>> So no. This needs way more thought for managed interrupts and you cannot
>> do that at the PCI layer.
>
> Maybe we should not be doing anything in the case of managed IRQs as they
> are anyways pinned to the housekeeping CPUs as long as we have the
> 'managed_irq' option included in the kernel cmdline.

Exactly. For the PCI side this vector limiting has to be restricted to
the non managed case.

>>  Only the affinity spreading mechanism can do
>> the right thing here.
>
> I can definitely explore this further.
>
> However, IMHO we would still need a logic to prevent the devices from
> creating excess vectors.

Managed interrupts are preventing exactly that by pinning the interrupts
and queues to one or a set of CPUs, which prevents vector exhaustion on
CPU hotplug.

Non-managed, yes that is and always was a problem. One of the reasons
why managed interrupts exist.

Thanks,

tglx

Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

2020-10-20 Thread Thomas Gleixner

On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>  
> + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> + /*
> +  * If we have isolated CPUs for use by real-time tasks, to keep the
> +  * latency overhead to a minimum, device-specific IRQ vectors are moved
> +  * to the housekeeping CPUs from the userspace by changing their
> +  * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +  * running out of IRQ vectors.
> +  */

This is not true for managed interrupts. The interrupts affinity of
those cannot be changed by user space.

> + if (hk_cpus < num_online_cpus()) {
> + if (hk_cpus < min_vecs)
> + max_vecs = min_vecs;
> + else if (hk_cpus < max_vecs)
> + max_vecs = hk_cpus;
> + }

So now with that assume a 16 core machine (HT off for simplicity)

17 Requested interrupts (1 general, 16 queues)

Managed interrupts will allocate

   1  general interrupt which is free movable by user space
   16 managed interrupts for queues (one per CPU)

This allows the driver to have 16 queues, i.e. one queue per CPU. These
interrupts are only used when an application on a CPU issues I/O.

With the above change this will result

   1  general interrupt which is free movable by user space
   1  managed interrupts (possible affinity to all 16 CPUs, but routed
  to housekeeping CPU as long as there is one online)

So the device is now limited to a single queue which also affects the
housekeeping CPUs because now they have to share a single queue.

With larger machines this gets even worse.

So no. This needs way more thought for managed interrupts and you cannot
do that at the PCI layer. Only the affinity spreading mechanism can do
the right thing here.

Thanks,

tglx

Re: Remove __napi_schedule_irqoff?

2020-10-19 Thread Thomas Gleixner

On Sun, Oct 18 2020 at 10:19, Jakub Kicinski wrote:
> On Sun, 18 Oct 2020 10:20:41 +0200 Heiner Kallweit wrote:
>> >> Otherwise a non-solution could be to make IRQ_FORCED_THREADING
>> >> configurable.  
>> > 
>> > I have to say I do not understand why we want to defer to a thread the
>> > hard IRQ that we use in NAPI model.
>> >   
>> Seems like the current forced threading comes with the big hammer and
>> thread-ifies all hard irq's. To avoid this all NAPI network drivers
>> would have to request the interrupt with IRQF_NO_THREAD.

In a !RT kernel, forced threading (via commandline option) is mostly a
debug aid. It's pretty useful when something crashes in hard interrupt
context which usually takes the whole machine down. It's rather unlikely
to be used on production systems, and if so then the admin surely should
know what he's doing.

> Right, it'd work for some drivers. Other drivers try to take spin locks
> in their IRQ handlers.

I checked a few which do and some of these spinlocks just protect
register access and are not used for more complex serialization. So
these could be converted to raw spinlocks because their scope is short
and limited. But yes, you are right that this might be an issue in
general.

> What gave me a pause was that we have a busy loop in napi_schedule_prep:
>
> bool napi_schedule_prep(struct napi_struct *n)
> {
>   unsigned long val, new;
>
>   do {
>   val = READ_ONCE(n->state);
>   if (unlikely(val & NAPIF_STATE_DISABLE))
>   return false;
>   new = val | NAPIF_STATE_SCHED;
>
>   /* Sets STATE_MISSED bit if STATE_SCHED was already set
>* This was suggested by Alexander Duyck, as compiler
>* emits better code than :
>* if (val & NAPIF_STATE_SCHED)
>* new |= NAPIF_STATE_MISSED;
>*/
>   new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED *
>  NAPIF_STATE_MISSED;
>   } while (cmpxchg(&n->state, val, new) != val);
>
>   return !(val & NAPIF_STATE_SCHED);
> }
>
>
> Dunno how acceptable this is to run in an IRQ handler on RT..

In theory it's bad, but I don't think it's a big deal in reality.

Thanks,

tglx

Re: Remove __napi_schedule_irqoff?

2020-10-18 Thread Thomas Gleixner

Jakub,

On Sat, Oct 17 2020 at 16:29, Jakub Kicinski wrote:
> On Sat, 17 Oct 2020 15:45:57 +0200 Heiner Kallweit wrote:
>> It turned out that this most of the time isn't safe in certain
>> configurations:
>> - if CONFIG_PREEMPT_RT is set
>> - if command line parameter threadirqs is set
>> 
>> Having said that drivers are being switched back to __napi_schedule(),
>> see e.g. patch in [0] and related discussion. I thought about a
>> __napi_schedule version checking dynamically whether interrupts are
>> disabled. But checking e.g. variable force_irqthreads also comes at
>> a cost, so that we may not see a benefit compared to calling
>> local_irq_save/local_irq_restore.

This does not have to be a variable check. It's trivial enough to make
it a static key.

>> If more or less all users have to switch back, then the question is
>> whether we should remove __napi_schedule_irqoff.
>> Instead of touching all users we could make  __napi_schedule_irqoff
>> an alias for __napi_schedule for now.
>> 
>> [0] https://lkml.org/lkml/2020/10/8/706
>
> We're effectively calling raise_softirq_irqoff() from IRQ handlers,
> with force_irqthreads == true that's no longer legal.

Hrmpf, indeed. When force threading was introduced that did not exist.

The forced threaded handler is always invoked with bottom halfs disabled
and bottom half processing either happens when the handler returns and
the thread wrapper invokes local_bh_enable() or from ksoftirq. As
everything runs in thread context CPU local serialization through
local_bh_disable() is sufficient.

> Thomas - is the expectation that IRQ handlers never assume they have 
> IRQs disabled going forward? We don't have any performance numbers 
> but if I'm reading Agner's tables right POPF is 18 cycles on Broadwell.
> Is PUSHF/POPF too cheap to bother?

It's not only PUSHF/POPF it's PUSHF,CLI -> POPF, but yeah it's pretty
cheap nowadays. But doing the static key change might still be a good
thing. Completely untested patch below.

Quoting Eric:

> I have to say I do not understand why we want to defer to a thread the
> hard IRQ that we use in NAPI model.
> 
> Whole point of NAPI was to keep hard irq handler very short.

Right. In case the interrupt handler is doing not much more than
scheduling NAPI then you can request it with IRQF_NO_THREAD, which will
prevent it from being threaded even on RT.

> We should focus on transferring the NAPI work (potentially disrupting
> ) to a thread context, instead of the very minor hard irq trigger.

Read about that. I only looked briefly at the patches and wondered why
this has it's own threading mechanism and is not using the irq thread
mechanics. I'll have a closer look in the next days.

Thanks,

tglx
---
--- a/drivers/ide/ide-iops.c
+++ b/drivers/ide/ide-iops.c
@@ -109,7 +109,6 @@ int __ide_wait_stat(ide_drive_t *drive,
ide_hwif_t *hwif = drive->hwif;
const struct ide_tp_ops *tp_ops = hwif->tp_ops;
unsigned long flags;
-   bool irqs_threaded = force_irqthreads;
int i;
u8 stat;
 
@@ -117,7 +116,7 @@ int __ide_wait_stat(ide_drive_t *drive,
stat = tp_ops->read_status(hwif);
 
if (stat & ATA_BUSY) {
-   if (!irqs_threaded) {
+   if (!force_irqthreads_active()) {
local_save_flags(flags);
local_irq_enable_in_hardirq();
}
@@ -133,13 +132,13 @@ int __ide_wait_stat(ide_drive_t *drive,
if ((stat & ATA_BUSY) == 0)
break;
 
-   if (!irqs_threaded)
+   if (!force_irqthreads_active())
local_irq_restore(flags);
*rstat = stat;
return -EBUSY;
}
}
-   if (!irqs_threaded)
+   if (!force_irqthreads_active())
local_irq_restore(flags);
}
/*
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -406,7 +406,8 @@ static ide_startstop_t pre_task_out_intr
return startstop;
}
 
-   if (!force_irqthreads && (drive->dev_flags & IDE_DFLAG_UNMASK) == 0)
+   if (!force_irqthreads_active() &&
+   (drive->dev_flags & IDE_DFLAG_UNMASK) == 0)
local_irq_disable();
 
ide_set_handler(drive, &task_pio_intr, WAIT_WORSTCASE);
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -489,12 +489,16 @@ extern int irq_set_irqchip_state(unsigne
 
 #ifdef CONFIG_IRQ_FORCED_THREADING
 # ifdef CONFIG_PREEMPT_RT
-#  define force_irqthreads (true)
+static inline bool force_irqthreads_active(void) { return true; }
 # else
-extern bool force_irqthreads;
+extern struct static_key_false force_irqthreads_key;
+static inline bool force_irqthreads_active(void)
+{
+   return static_branch_unlikely(&force_irqthreads_ke

Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-17 Thread Thomas Gleixner

On Sat, Oct 17 2020 at 16:15, Alex Belits wrote:
> On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote:
>> On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
>> > I think that the goal of "finding source of disturbance" interface
>> > is
>> > different from what can be accomplished by tracing in two ways:
>> > 
>> > 1. "Source of disturbance" should provide some useful information
>> > about
>> > category of event and it cause as opposed to determining all
>> > precise
>> > details about things being called that resulted or could result in
>> > disturbance. It should not depend on the user's knowledge about
>> > details
>> 
>> Tracepoints already give you selectively useful information.
>
> Carefully placed tracepoints also can give the user information about
> failures of open(), write(), execve() or mmap(). However syscalls still
> provide an error code instead of returning generic failure and letting
> user debug the cause.

I have absolutely no idea what you are trying to tell me.

Thanks,

tglx

Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel

2020-10-17 Thread Thomas Gleixner

On Sat, Oct 17 2020 at 01:08, Alex Belits wrote:
> On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote:
>> On 10/4/20 7:14 PM, Frederic Weisbecker wrote:
> I think that the goal of "finding source of disturbance" interface is
> different from what can be accomplished by tracing in two ways:
>
> 1. "Source of disturbance" should provide some useful information about
> category of event and it cause as opposed to determining all precise
> details about things being called that resulted or could result in
> disturbance. It should not depend on the user's knowledge about
> details

Tracepoints already give you selectively useful information.

> of implementations, it should provide some definite answer of what
> happened (with whatever amount of details can be given in a generic
> mechanism) even if the user has no idea how those things happen and
> what part of kernel is responsible for either causing or processing
> them. Then if the user needs further details, they can be obtained with
> tracing.

It's just a matter of defining the tracepoint at the right place.

> 2. It should be usable as a runtime error handling mechanism, so the
> information it provides should be suitable for application use and
> logging. It should be usable when applications are running on a system
> in production, and no specific tracing or monitoring mechanism can be
> in use.

That's a strawman really. There is absolutely no reason why a specific
set of tracepoints cannot be enabled on a production system.

Your tracker is a monitoring mechanism, just a different flavour.  By
your logic above it cannot be enabled on a production system either.

Also you can enable tracepoints from a control application, consume, log
and act upon them. It's not any different from opening some magic
isolation tracker interface. There are even multiple ways to do that
including libraries.

> If, say, thousands of devices are controlling neutrino detectors on an
> ocean floor, and in a month of work one of them got one isolation
> breaking event, it should be able to report that isolation was broken
> by an interrupt from a network interface, so the users will be able to
> track it down to some userspace application reconfiguring those
> interrupts.

Tracing can do that and it can do it selectively on the isolated
CPUs. It's just a matter of proper configuration and usage.

> It will be a good idea to make such mechanism optional and suitable for
> tracking things on conditions other than "always enabled" and "enabled
> with task isolation".

Tracing already provides that. Tracepoints are individually controlled
and filtered.

> However in my opinion, there should be something in kernel entry
> procedure that, if enabled, prepared something to be filled by the
> cause data, and we know at least one such situation when this kernel
> entry procedure should be triggered -- when task isolation is on.

A tracepoint will gather that information for you.

task isolation is not special, it's just yet another way to configure
and use a system and tracepoints provide everything you need with the
bonus that you can gather more correlated information when you need it.

In fact tracing and tracepoints have replaced all specialized trackers
which were in the kernel before tracing was available. We're not going
to add a new one just because.

If there is anything which you find that tracing and tracepoints cannot
provide then the obvious solution is to extend that infrastructure so it
can serve your usecase.

Thanks,

tglx

Re: [PATCH 0/7] TC-ETF support PTP clocks series

2020-10-15 Thread Thomas Gleixner

Andreas,

On Wed, Oct 14 2020 at 09:12, Andreas Meisinger wrote:
> Sorry about the wrong format/style of my last mail, hope I did get it
> right this time.

No problem. Yes this looks better. The only thing which could be
improved is that your mail client fails to add

 In-Reply-To: 
 References:   ...

headers and instead has the MS lookout specific

Thread-Topic: [PATCH 0/7] TC-ETF support PTP clocks series
Thread-Index: AdaiB8a+x+RhfhtwSZ+NKfvRdyiJkw==

headers. If you look at the lore archive you see the effect:

https://lore.kernel.org/r/am0pr10mb3073f9694ecad4f612a86fa7fa...@am0pr10mb3073.eurprd10.prod.outlook.com

and that happens to all mail clients which use threading based on the
standard headers. There is config knob in lookout to enable them.

> Let me first point at an important thing because we did have
> discussions here about it too. As of the manpages Linux CLOCK_TAI
> seems to be defined as an nonsettable clock which does have the same
> behaviour as of international atomic time TAI. Yet if it's nonsettable
> it can't be linked or synchronized to the value of International
> Atomic Time?
>
> On the other hand there seems to be code in place to change the
> CLOCK_TAI offset thus making it basically sort of "setable"?

It obviously needs to be modifiable in some way, otherwise
synchronization to a master clock via PTP would not work at all.

But it cannot be set in the way of settimeofday()/clock_settime() like
CLOCK_REALTIME.

>> The user space daemon which does the correlation between these PTP
>> domains and TAI is required in any case, so the magic clock
>> TAI_PRIVATE is not having any advantage.

> I think a userspace daemon handling the translation information
> between different clocks would be fine. I think it's not really that
> relevant who exactly does apply the translation. Kernel level might be
> a little bit more precise but I guess it'd depend on other factors
> too.

Not really. The kernel still provides the timestamp pairs/triplets in the
best way the underlying hardware provides it. Some hardware can even
provide hardware assistet pairs of ART and PTP clock.

> Yet all translation based approaches have in common, setting a clock,
> renders translations done in past invalid. It would be required to fix
> old translated values along with setting the clock. I assume we
> couldn't distinguish between "translated" values and genuine values
> after translation, so fixing them might not be possible at all.

CLOCK_TAI is not really set after the initial sychronization. It's
frequency corrected without causing jumps. PTP daemon uses a PLL based
algorithm for that.

Of course this adjustment has side effects for translation.

> In our usecase we do have a clock for network operation which can't be
> set. We can guarantee this because we are able to define the network
> not being operational when the defined time is not available 😉.
> Having this defined the remaining option would be the target clock to
> be set. As of your suggestion that would be CLOCK_TAI. So at this
> point "setable" or "nonsettable" would become important. Here
> "setable" would introduce a dependency between the clocks. Being
> independent from clocks outside of our control was exactly the reason
> to introduce the separate network clock. To me this means if CLOCK_TAI
> could be changed by anything it couldn't be the base clock for our
> usecase if it can't it might be a solution.

It's under your control as system designer how you operate CLOCK_TAI.

If you use the PTP daemon then it will sync CLOCK_TAI to the PTP clock
of your choice. If you don't have PTP at all then you can use NTP to
sync to a time server, which is less accurate. You can use PPS or just
use nothing.

The kernel does not care which non-standard base time or frequency you
chose.

Applications which communicate over network might care if the other side
uses a differnet time universe. Log files which start at 1971 might be
interesting to analyse against the log file of your other system which
starts in 2020.

>> Depending on the frequency drift between CLOCK_TAI and clock PTP/$N
>> the timer expiry might be slightly inaccurate, but surely not more
>> inaccurate than if that conversion is done purely in user space.
>>
>> The self rearming posix timers would work too, but the self rearming
>> is based on CLOCK_TAI, so rounding errors and drift would be
>> accumulative. So I'd rather stay away from them.
>
> As of the time ranges typically used in tsn networks the drift error
> for single shot timers most likely isn't a big deal. But as you
> suggest I'd stay away from long running timers as well rearming ones
> too, I guess they'd be mostly unusable.

Depends. It's a matter of hardware properties, system requirements,
application/system designers decisions. So what you consider unusable
for your system might be perfectly fine for others.

If we add support for this type of correlation then of course these
things need to be documented.

> Rig

Re: [PATCH 3/3] timekeeping: remove arch_gettimeoffset

2020-10-13 Thread Thomas Gleixner

On Thu, Oct 08 2020 at 17:46, Arnd Bergmann wrote:
> With Arm EBSA110 gone, nothing uses it any more, so the corresponding
> code and the Kconfig option can be removed.

Yay! It only took 11+ years to get rid of that.

Feel free to route it through your tree.

Acked-by: Thomas Gleixner

Re: AW: [PATCH 0/7] TC-ETF support PTP clocks series

2020-10-09 Thread Thomas Gleixner

Andreas,

On Fri, Oct 09 2020 at 11:17, Andreas Meisinger wrote:

please do not top-post and trim your replies.

> Yet we do already have usecases where this can't be done. Additionally
> a lot of discussions at this topic are ongoing in 60802 profile
> creation too.  Some of our usecases do require a network which does
> not depend on any external timesource. This might be due to the
> network not being connected (to the internet) or just because the
> network may not be able to rely on or trust an external
> timesource. Some reasons for this might be safety, security,
> availability or legal implications ( e.g. if a machine builder has to
> guarantee operation of a machine which depends on an internal tsn
> network).

I'm aware of the reasons for these kind of setups.

> About your question if an application needs to be able to sync to
> multiple timescales. A small count of usecases even would require
> multiple independent timesources to be used. At the moment they all
> seem to be located in the area of extreme high availability. There's
> ongoing evaluation about this issues and we're not sure if there's a
> way to do this without special hardware so we didn't address it here.

Reading several raw PTP clocks is always possible through the existing
interfaces and if the coordidation between real TAI and the raw PTP
clocks is available, then these interfaces could be extended to provide
time normalized to real TAI.

But that does not allow to utilize the magic clocks for arming timers so
these have to be based on some other clock and the application needs to do
the conversion back and forth.

Now I said that we could abuse time name spaces for providing access to
_one_ magic TAI clock which lets the kernel do that work, but thinking
more about it, it should be possible to do so for all of them even
without name spaces.

The user space daemon which does the correlation between these PTP
domains and TAI is required in any case, so the magic clock TAI_PRIVATE
is not having any advantage.

If that correlation exists then at least clock_nanosleep() should be
doable. So clock_nanosleep(clock PTP/$N) would convert the sleep time to
TAI and queue a timer internally on the CLOCK_TAI base.

Depending on the frequency drift between CLOCK_TAI and clock PTP/$N the
timer expiry might be slightly inaccurate, but surely not more
inaccurate than if that conversion is done purely in user space.

The self rearming posix timers would work too, but the self rearming is
based on CLOCK_TAI, so rounding errors and drift would be accumulative.
So I'd rather stay away from them.

If there is no deamon which manages the correlation then the syscall
would fail.

If such a coordination exists, then the whole problem in the TSN stack
is gone. The core can always operate on TAI and the network device which
runs in a different time universe would use the same conversion data
e.g. to queue a packet for HW based time triggered transmission. Again
subject to slight inaccuracy, but it does not come with all the problems
of dynamic clocks, locking issues etc. As the frequency drift between
PTP domains is neither fast changing nor randomly jumping around the
inaccuracy might even be a mostly academic problem.

Thoughts?

Thanks,

tglx

Re: [PATCH 0/7] TC-ETF support PTP clocks series

2020-10-02 Thread Thomas Gleixner

Vinicius,

On Fri, Oct 02 2020 at 12:01, Vinicius Costa Gomes wrote:
> I think that there's an underlying problem/limitation that is the cause
> of the issue (or at least a step in the right direction) you are trying
> to solve: the issue is that PTP clocks can't be used as hrtimers.

That's only an issue if PTP time != CLOCK_TAI, which is insane to begin
with.

As I know that these insanities exists in real world setups, e.g. grand
clock masters which start at the epoch which causes complete disaster
when any of the slave devices booted earlier. Obviously people came
up with system designs which are even more insane.

> I didn't spend a lot of time thinking about how to solve this (the only
> thing that comes to mind is having a timecounter, or similar, "software
> view" over the PHC clock).

There are two aspects:

 1) What's the overall time coordination especially for applications?

PTP is for a reason based on TAI which allows a universal
representation of time. Strict monotonic, no time zones, no leap
seconds, no bells and whistels.

Using TAI in distributed systems solved a gazillion of hard problems
in one go.

TSN depends on PTP and that obviously makes CLOCK_TAI _the_ clock of
choice for schedules and whatever is needed. It just solves the
problem nicely and we spent a great amount of time to make
application development for TSN reasonable and hardware agnostic.

Now industry comes along and decides to introducde independent time
universes. The result is a black hole for programmers because they
now have to waste effort - again - on solving the incredibly hard
problems of warping space and time.

The amount of money saved by not having properly coordinated time
bases in such systems is definitely marginal compared to the amount
of non-sensical work required to fix it in software.

 2) How can an OS provide halfways usable interfaces to handle this
trainwreck?

Access to the various time universes is already available through
the dynamic POSIX clocks. But these interfaces have been designed
for the performance insensitive work of PTP daemons and not for the
performance critical work of applications dealing with real-time
requirements of all sorts.

As these raw PTP clocks are hardware dependend and only known at
boot / device discovery time they cannot be exposed to the kernel
internaly in any sane way. Also the user space interface has to be
dynamic which rules out the ability to assign fixed CLOCK_* ids.

As a consequence these clocks cannot provide timers like the regular
CLOCK_* variants do, which makes it insanely hard to develop sane
and portable applications.

What comes to my mind (without spending much thought on it) is:

   1) Utilize and extend the existing PTP mechanisms to calculate
  the time relationship between the system wide CLOCK_TAI and
  the uncoordinated time universe. As offset is a constant and
  frequency drift is not a high speed problem this can be done
  with a userspace daemon of some sorts.

2) Provide CLOCK_TAI_PRIVATE which defaults to CLOCK_TAI,
   i.e. offset = 0 and frequency ratio = 1 : 1

3) (Ab)use the existing time namespace to provide a mechanism to
   adjust the offset and frequency ratio of CLOCK_TAI_PRIVATE
   which is calculated by #1

   This is the really tricky part and comes with severe
   limitations:

 - We can't walk task list to find tasks which have their
   CLOCK_TAI_PRIVATE associated with a particular
   incarnation of PCH/PTP universe, so some sane referencing
   of the underlying parameters to convert TAI to
   TAI_PRIVATE and vice versa has to be found. Life time
   problems are going to be interesting to deal with.

 - An application cannot coordinate multiple PCH/PTP domains
   and has to restrict itself to pick ONE disjunct time
   universe.

   Whether that's a reasonable limitation I don't know
   simply because the information provided in this patch
   series is close to zero.

 - Preventing early timer expiration caused by frequency
   drift is not trivial either.

  TBH, just thinking about all of that makes me shudder and my knee
  jerk reaction is: NO WAY!

Why the heck can't hardware people and system designers finally
understand that time is not something they can define at their
own peril?

The "Let's solve it in software so I don't have to think about it"
design approach strikes again. This caused headaches for the past five
decades, but people obviously never learn.

That said, I'm open for solutions which are at least in the proximity of
sane, but that needs a lot more information about the use cases and the
implications and not just some hand

Re: [PATCH 7/7] TC-ETF support PTP clocks

2020-10-01 Thread Thomas Gleixner

On Thu, Oct 01 2020 at 22:51, Erez Geva wrote:

>   - Add support for using a POSIX dynamic clock with
> Traffic control Earliest TxTime First (ETF) Qdisc.



> --- a/include/uapi/linux/net_tstamp.h
> +++ b/include/uapi/linux/net_tstamp.h
> @@ -167,6 +167,11 @@ enum txtime_flags {
>   SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_FLAGS_LAST - 1) |
>SOF_TXTIME_FLAGS_LAST
>  };
> +/*
> + * Clock ID to use with POSIX clocks
> + * The ID must be u8 to fit in (struct sock)->sk_clockid
> + */
> +#define SOF_TXTIME_POSIX_CLOCK_ID (0x77)

Random number with a random name. 
  
>  struct sock_txtime {
>   __kernel_clockid_t  clockid;/* reference clockid */
> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
> index c0de4c6f9299..8e3e0a61fa58 100644
> --- a/net/sched/sch_etf.c
> +++ b/net/sched/sch_etf.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -40,19 +41,40 @@ struct etf_sched_data {
>   struct rb_root_cached head;
>   struct qdisc_watchdog watchdog;
>   ktime_t (*get_time)(void);
> +#ifdef CONFIG_POSIX_TIMERS
> + struct posix_clock *pclock; /* pointer to a posix clock */

Tail comments suck because they disturb the reading flow and this
comment has absolute zero value.

Comments are required to explain things which are not obvious...

> +#endif /* CONFIG_POSIX_TIMERS */

Also this #ifdeffery is bonkers. How is TSN supposed to work without
POSIX_TIMERS in the first place?

>  static const struct nla_policy etf_policy[TCA_ETF_MAX + 1] = {
>   [TCA_ETF_PARMS] = { .len = sizeof(struct tc_etf_qopt) },
>  };
>  
> +static inline ktime_t get_now(struct Qdisc *sch, struct etf_sched_data *q)
> +{
> +#ifdef CONFIG_POSIX_TIMERS
> + if (IS_ERR_OR_NULL(q->get_time)) {
> + struct timespec64 ts;
> + int err = posix_clock_gettime(q->pclock, &ts);
> +
> + if (err) {
> + pr_warn("Clock is disabled (%d) for queue %d\n",
> + err, q->queue);
> + return 0;

That's really useful error handling.

> + }
> + return timespec64_to_ktime(ts);
> + }
> +#endif /* CONFIG_POSIX_TIMERS */
> + return q->get_time();
> +}
> +
>  static inline int validate_input_params(struct tc_etf_qopt *qopt,
>   struct netlink_ext_ack *extack)
>  {
>   /* Check if params comply to the following rules:
>*  * Clockid and delta must be valid.
>*
> -  *  * Dynamic clockids are not supported.
> +  *  * Dynamic CPU clockids are not supported.
>*
>*  * Delta must be a positive or zero integer.
>*
> @@ -60,11 +82,22 @@ static inline int validate_input_params(struct 
> tc_etf_qopt *qopt,
>* expect that system clocks have been synchronized to PHC.
>*/
>   if (qopt->clockid < 0) {
> +#ifdef CONFIG_POSIX_TIMERS
> + /**
> +  * Use of PTP clock through a posix clock.
> +  * The TC application must open the posix clock device file
> +  * and use the dynamic clockid from the file description.

What? How is the code which calls into this guaranteed to have a valid
file descriptor open for a particular dynamic posix clock?

> +  */
> + if (!is_clockid_fd_clock(qopt->clockid)) {
> + NL_SET_ERR_MSG(extack,
> +"Dynamic CPU clockids are not 
> supported");
> + return -EOPNOTSUPP;
> + }
> +#else /* CONFIG_POSIX_TIMERS */
>   NL_SET_ERR_MSG(extack, "Dynamic clockids are not supported");
>   return -ENOTSUPP;
> - }
> -
> - if (qopt->clockid != CLOCK_TAI) {
> +#endif /* CONFIG_POSIX_TIMERS */
> + } else if (qopt->clockid != CLOCK_TAI) {
>   NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be 
> used");
>   return -EINVAL;
>   }
> @@ -103,7 +136,7 @@ static bool is_packet_valid(struct Qdisc *sch, struct 
> etf_sched_data *q,
>   return false;
>  
>  skip:
> - now = q->get_time();
> + now = get_now(sch, q);

Yuck.

is_packet_valid() is invoked via:

__dev_queue_xmit()
  __dev_xmit_skb()
 etf_enqueue_timesortedlist()
   is_packet_valid()

__dev_queue_xmit() does

   rcu_read_lock_bh();

and your get_now() does

posix_clock_gettime()
down_read(&clk->rwsem);

 > FAIL

down_read() might sleep and cannot be called from a BH disabled
region. This clearly has never been tested with any mandatory debug
option enabled. Why am I not surprised?

Aside of accessing PCH clock being slow at hell this cannot ever work
and there is no way to make it work in any consistent form.

If you have several NICs on several PCH domains then all of these
domains should have one thing in common: CLOCK_TAI and the frequency.

If that'

Re: [PATCH 2/7] Function to retrieve main clock state

2020-10-01 Thread Thomas Gleixner

On Fri, Oct 02 2020 at 00:05, Thomas Gleixner wrote:
> On Thu, Oct 01 2020 at 22:51, Erez Geva wrote:
>
> same comments as for patch 1 apply.
>
>> Add kernel function to retrieve main clock oscillator state.
>
> The function you are adding is named adjtimex(). adjtimex(2) is a well
> known user space interface and naming a read only kernel interface the
> same way is misleading.

Aside of that there is no user for this function in this series. We're
not adding interfaces just because we can.

Thanks,

tglx

Re: [PATCH 3/7] Functions to fetch POSIX dynamic clock object

2020-10-01 Thread Thomas Gleixner

On Thu, Oct 01 2020 at 22:51, Erez Geva wrote:
> Add kernel functions to fetch a pointer to a POSIX dynamic clock
> using a user file description dynamic clock ID.

And how is that supposed to work. What are the lifetime rules?
  
> +struct posix_clock *posix_clock_get_clock(clockid_t id)
> +{
> + int err;
> + struct posix_clock_desc cd;

The core code uses reverse fir tree ordering of variable declaration
based on the length:

struct posix_clock_desc cd;
int err;

> + /* Verify we use posix clock ID */
> + if (!is_clockid_fd_clock(id))
> + return ERR_PTR(-EINVAL);
> +
> + err = get_clock_desc(id, &cd);

So this is a kernel interface and get_clock_desc() does:

struct file *fp = fget(clockid_to_fd(id));

How is that file descriptor valid in random kernel context?

> + if (err)
> + return ERR_PTR(err);
> +
> + get_device(cd.clk->dev);

The purpose of this is? Comments are overrated...

> + put_clock_desc(&cd);
> +
> + return cd.clk;
> +}
> +EXPORT_SYMBOL_GPL(posix_clock_get_clock);
> +
> +int posix_clock_put_clock(struct posix_clock *clk)
> +{
> + if (IS_ERR_OR_NULL(clk))
> + return -EINVAL;
> + put_device(clk->dev);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(posix_clock_put_clock);
> +
> +int posix_clock_gettime(struct posix_clock *clk, struct timespec64 *ts)
> +{
> + int err;
> +
> + if (IS_ERR_OR_NULL(clk))
> + return -EINVAL;
> +
> + down_read(&clk->rwsem);

Open coding the logic of get_posix_clock() and having a copy here and
in the next function is really useful.

Thanks,

tglx

Re: [PATCH 5/7] Traffic control using high-resolution timer issue

2020-10-01 Thread Thomas Gleixner

On Thu, Oct 01 2020 at 22:51, Erez Geva wrote:

Issue? You again fail to decribe the problem.

>   - Add new schedule function for Qdisc watchdog.
> The function avoids reprogram if watchdog already expire
> before new expire.

Why can't the existing function not be changed to do that?

>   - Use new schedule function in ETF.
>
>   - Add ETF range value to kernel configuration.
> as the value is characteristic to Hardware.

No. That's completely wrong. Hardware properties need to be established
at boot/runtime otherwise you can't build a kernel which runs on
different platforms.

> +void qdisc_watchdog_schedule_soon_ns(struct qdisc_watchdog *wd, u64 expires,
> +  u64 delta_ns)

schedule soon? That sounds like schedule it sooner than later, but I
don't care.

> +{
> + if (test_bit(__QDISC_STATE_DEACTIVATED,
> +  &qdisc_root_sleeping(wd->qdisc)->state))
> + return;
> +
> + if (wd->last_expires == expires)
> + return;

How is this supposed to be valid without checking whether the timer is
queued in the first place?

Maybe the function name should be schedule_soon_or_not()

> + /**

Can you please use proper comment style? This is neither network comment
style nor the regular sane kernel comment style. It's kerneldoc comment
style which is reserved for function and struct documentation.

> +  * If expires is in [0, now + delta_ns],
> +  * do not program it.

This range info is just confusing. Just use plain english.

> +  */
> + if (expires <= ktime_to_ns(hrtimer_cb_get_time(&wd->timer)) + delta_ns)
> + return;

So if the watchdog is NOT queued and expiry < now + delta_ns then this
returns and nothing starts the timer ever.

That might all be correct, but without any useful explanation it looks
completely bogus.

> + /**
> +  * If timer is already set in [0, expires + delta_ns],
> +  * do not reprogram it.
> +  */
> + if (hrtimer_is_queued(&wd->timer) &&
> + wd->last_expires <= expires + delta_ns)
> + return;
> +
> + wd->last_expires = expires;
> + hrtimer_start_range_ns(&wd->timer,
> +ns_to_ktime(expires),
> +delta_ns,
> +HRTIMER_MODE_ABS_PINNED);
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_schedule_soon_ns);
> +
>  void qdisc_watchdog_cancel(struct qdisc_watchdog *wd)
>  {
>   hrtimer_cancel(&wd->timer);
> diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c
> index c48f91075b5c..48b2868c4672 100644
> --- a/net/sched/sch_etf.c
> +++ b/net/sched/sch_etf.c
> @@ -20,6 +20,11 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_NET_SCH_ETF_TIMER_RANGE
> +#define NET_SCH_ETF_TIMER_RANGE CONFIG_NET_SCH_ETF_TIMER_RANGE
> +#else
> +#define NET_SCH_ETF_TIMER_RANGE (5 * NSEC_PER_USEC)
> +#endif
>  #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON)
>  #define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON)
>  #define SKIP_SOCK_CHECK_IS_SET(x) ((x)->flags & TC_ETF_SKIP_SOCK_CHECK)
> @@ -128,8 +133,9 @@ static void reset_watchdog(struct Qdisc *sch)
>   return;
>   }
>  
> - next = ktime_sub_ns(skb->tstamp, q->delta);
> - qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next));
> + next = ktime_sub_ns(skb->tstamp, q->delta + NET_SCH_ETF_TIMER_RANGE);
> + qdisc_watchdog_schedule_soon_ns(&q->watchdog, ktime_to_ns(next),
> + NET_SCH_ETF_TIMER_RANGE);

This is changing 5 things at once. That's just wrong.

patch 1: Add the new function and explain why it's required

patch 2: Make reset_watchdog() use it

patch 3: Add a mechanism to retrieve the magic hardware range from the
 underlying hardware driver and add that value to q->delta or
 set q->delta to it. Whatever makes sense. The current solution
 makes no sense at all

Thanks,

tglx

Re: [PATCH 4/7] Fix qdisc_watchdog_schedule_range_ns range check

2020-10-01 Thread Thomas Gleixner

On Thu, Oct 01 2020 at 22:51, Erez Geva wrote:

Fixes should be at the beginning of a patch series and not be hidden
somewhere in the middle.

>- As all parameters are unsigned.

This is not a sentence and this list style does not make that changelog
more readable.

>- If 'expires' is bigger than 'last_expires' then the left expression
>  overflows.

This would be the most important information and should be clearly
spelled out as problem description at the very beginning of the change
log.

>- It is better to use addition and check both ends of the range.

Is better? Either your change is correcting the problem or not. Just
better but incorrect does not cut it.

But let's look at the problem itself. The check is about:

B <= A <= B + C

A, B, C are all unsigned. So if B > A then the result is false.

Now lets look at the implementation:

if (A - B <= C)
return;

which works correctly due the wonders of unsigned math.

For B <= A the check is obviously correct.

If B > A then the result of the unsigned subtraction A - B is a very
large positive number which is pretty much guaranteed to be larger than
C, i.e. the result is false.

So while not immediately obvious, it's still correct.

Thanks,

tglx

1 2 3 4 >

1 - 100 of 392 matches

Mail list logo