Re: [patch 1/1] genirq: Disable interrupts for force threaded handlers
On Wed, Mar 17 2021 at 15:48, Sebastian Andrzej Siewior wrote: > On 2021-03-17 15:38:52 [+0100], Thomas Gleixner wrote: >> thread(irq_A) >> irq_handler(A) >> spin_lock(&foo->lock); >> >> interrupt(irq_B) >> irq_handler(B) >> spin_lock(&foo->lock); > > It will not because both threads will wake_up(thread). It is an issue if > - if &foo->lock is shared between a hrtimer and threaded-IRQ > - if &foo->lock is shared between a non-threaded and thread-IRQ > - if &foo->lock is shared between a printk() in hardirq context and > thread-IRQ as I learned today. That's the point and it's entirely clear from the above: A is thread context and B is hard interrupt context and if the lock is shared then it's busted. Otherwise we would not have this discussion at all.
[patch 1/1] genirq: Disable interrupts for force threaded handlers
With interrupt force threading all device interrupt handlers are invoked from kernel threads. Contrary to hard interrupt context the invocation only disables bottom halfs, but not interrupts. This was an oversight back then because any code like this will have an issue: thread(irq_A) irq_handler(A) spin_lock(&foo->lock); interrupt(irq_B) irq_handler(B) spin_lock(&foo->lock); This has been triggered with networking (NAPI vs. hrtimers) and console drivers where printk() happens from an interrupt which interrupted the force threaded handler. Now people noticed and started to change the spin_lock() in the handler to spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the interrupt request which in turn breaks RT. Fix the root cause and not the symptom and disable interrupts before invoking the force threaded handler which preserves the regular semantics and the usefulness of the interrupt force threading as a general debugging tool. For not RT this is not changing much, except that during the execution of the threaded handler interrupts are delayed until the handler returns. Vs. scheduling and softirq processing there is no difference. For RT kernels there is no issue. Fixes: 8d32a307e4fa ("genirq: Provide forced interrupt threading") Reported-by: Johan Hovold Signed-off-by: Thomas Gleixner Cc: Eric Dumazet Cc: Sebastian Andrzej Siewior Cc: netdev Cc: "David S. Miller" Cc: Krzysztof Kozlowski Cc: Greg Kroah-Hartman Cc: Andy Shevchenko CC: Peter Zijlstra Cc: linux-ser...@vger.kernel.org Cc: netdev --- kernel/irq/manage.c |4 1 file changed, 4 insertions(+) --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1142,11 +1142,15 @@ irq_forced_thread_fn(struct irq_desc *de irqreturn_t ret; local_bh_disable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_disable(); ret = action->thread_fn(action->irq, action->dev_id); if (ret == IRQ_HANDLED) atomic_inc(&desc->threads_handled); irq_finalize_oneshot(desc, action); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_enable(); local_bh_enable(); return ret; }
Re: [PATCH net 2/2] net: gro: Let the timeout timer expire in softirq context with `threadirqs'
On Thu, Apr 16 2020 at 15:59, Sebastian Andrzej Siewior wrote: > any comments from the timer department? Yes. > On 2019-11-27 18:37:19 [+0100], To Eric Dumazet wrote: >> On 2019-11-27 09:11:40 [-0800], Eric Dumazet wrote: >> > Resent in non HTML mode :/ >> don't worry, mutt handles both :) >> >> > Long story short, why hrtimer are not by default using threaded mode >> > in threadirqs mode ? >> >> Because it is only documented to thread only interrupts. Not sure if we >> want change this. >> In RT we expire most of the hrtimers in softirq context for other >> reasons. A subset of them still expire in hardirq context. >> >> > Idea of having some (but not all of them) hard irq handlers' now being >> > run from BH mode, >> > is rather scary. >> >> As I explained in my previous email: All IRQ-handlers fire in >> threaded-mode if enabled. Only the hrtimer is not affected by this >> change. >> >> > Also, hrtimers got the SOFT thing only in 4.16, while the GRO patch >> > went in linux-3.19 >> > >> > What would be the plan for stable trees ? >> No idea yet. We could let __napi_schedule_irqoff() behave like >> __napi_schedule(). It's not really a timer departement problem. It's an interrupt problem. With force threaded interrupts we don't call the handler with interrupts disabled. What sounded a good idea long ago, is actually bad. See https://lore.kernel.org/r/87eegdzzez@nanos.tec.linutronix.de Any leftover issues on a RT kernel are a different story, but for !RT this is the proper fix. I'll spin up a proper patch and tag it for stable... Thanks, tglx
Re: [patch 07/14] tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RT
On Tue, Mar 09 2021 at 16:21, Sebastian Andrzej Siewior wrote: > On 2021-03-09 16:00:37 [+0100], To Thomas Gleixner wrote: >> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h >> index 07c7329d21aa7..1c14ccd351091 100644 >> --- a/include/linux/interrupt.h >> +++ b/include/linux/interrupt.h >> @@ -663,15 +663,6 @@ static inline int tasklet_trylock(struct tasklet_struct >> *t) >> void tasklet_unlock(struct tasklet_struct *t); >> void tasklet_unlock_wait(struct tasklet_struct *t); >> >> -/* >> - * Do not use in new code. Waiting for tasklets from atomic contexts is >> - * error prone and should be avoided. >> - */ >> -static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t) >> -{ >> -while (test_bit(TASKLET_STATE_RUN, &t->state)) >> -cpu_relax(); >> -} > > Look at that. The forward declaration for tasklet_unlock_spin_wait() > should have remained. Sorry for that. No idea how I managed to mess that up and fail to notice. Brown paperbags to the rescue. Thanks, tglx
[patch 06/14] tasklets: Replace spin wait in tasklet_kill()
From: Peter Zijlstra tasklet_kill() spin waits for TASKLET_STATE_SCHED to be cleared invoking yield() from inside the loop. yield() is an ill defined mechanism and the result might still be wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_kill() is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner --- kernel/softirq.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -532,6 +532,16 @@ void __tasklet_hi_schedule(struct taskle } EXPORT_SYMBOL(__tasklet_hi_schedule); +static inline bool tasklet_clear_sched(struct tasklet_struct *t) +{ + if (test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) { + wake_up_var(&t->state); + return true; + } + + return false; +} + static void tasklet_action_common(struct softirq_action *a, struct tasklet_head *tl_head, unsigned int softirq_nr) @@ -551,8 +561,7 @@ static void tasklet_action_common(struct if (tasklet_trylock(t)) { if (!atomic_read(&t->count)) { - if (!test_and_clear_bit(TASKLET_STATE_SCHED, - &t->state)) + if (!tasklet_clear_sched(t)) BUG(); if (t->use_callback) t->callback(t); @@ -612,13 +621,11 @@ void tasklet_kill(struct tasklet_struct if (in_interrupt()) pr_notice("Attempt to kill tasklet from interrupt\n"); - while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) { - do { - yield(); - } while (test_bit(TASKLET_STATE_SCHED, &t->state)); - } + while (test_and_set_bit(TASKLET_STATE_SCHED, &t->state)) + wait_var_event(&t->state, !test_bit(TASKLET_STATE_SCHED, &t->state)); + tasklet_unlock_wait(t); - clear_bit(TASKLET_STATE_SCHED, &t->state); + tasklet_clear_sched(t); } EXPORT_SYMBOL(tasklet_kill);
[patch 14/14] tasklets: Switch tasklet_disable() to the sleep wait variant
-- NOT FOR IMMEDIATE MERGING -- Now that all users of tasklet_disable() are invoked from sleepable context, convert it to use tasklet_unlock_wait() which might sleep. Signed-off-by: Thomas Gleixner --- include/linux/interrupt.h |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -716,8 +716,7 @@ static inline void tasklet_disable_in_at static inline void tasklet_disable(struct tasklet_struct *t) { tasklet_disable_nosync(t); - /* Spin wait until all atomic users are converted */ - tasklet_unlock_spin_wait(t); + tasklet_unlock_wait(t); smp_mb(); }
[patch 12/14] PCI: hv: Use tasklet_disable_in_atomic()
From: Sebastian Andrzej Siewior The hv_compose_msi_msg() callback in irq_chip::irq_compose_msi_msg is invoked via irq_chip_compose_msi_msg(), which itself is always invoked from atomic contexts from the guts of the interrupt core code. There is no way to change this w/o rewriting the whole driver, so use tasklet_disable_in_atomic() which allows to make tasklet_disable() sleepable once the remaining atomic users are addressed. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Wei Liu Cc: Lorenzo Pieralisi Cc: Rob Herring Cc: Bjorn Helgaas Cc: linux-hyp...@vger.kernel.org Cc: linux-...@vger.kernel.org --- drivers/pci/controller/pci-hyperv.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1458,7 +1458,7 @@ static void hv_compose_msi_msg(struct ir * Prevents hv_pci_onchannelcallback() from running concurrently * in the tasklet. */ - tasklet_disable(&channel->callback_event); + tasklet_disable_in_atomic(&channel->callback_event); /* * Since this function is called with IRQ locks held, can't
[patch 13/14] firewire: ohci: Use tasklet_disable_in_atomic() where required
From: Sebastian Andrzej Siewior tasklet_disable() is invoked in several places. Some of them are in atomic context which prevents a conversion of tasklet_disable() to a sleepable function. The atomic callchains are: ar_context_tasklet() ohci_cancel_packet() tasklet_disable() ... ohci_flush_iso_completions() tasklet_disable() The invocation of tasklet_disable() from at_context_flush() is always in preemptible context. Use tasklet_disable_in_atomic() for the two invocations in ohci_cancel_packet() and ohci_flush_iso_completions(). Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: Stefan Richter Cc: linux1394-de...@lists.sourceforge.net --- drivers/firewire/ohci.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/drivers/firewire/ohci.c +++ b/drivers/firewire/ohci.c @@ -2545,7 +2545,7 @@ static int ohci_cancel_packet(struct fw_ struct driver_data *driver_data = packet->driver_data; int ret = -ENOENT; - tasklet_disable(&ctx->tasklet); + tasklet_disable_in_atomic(&ctx->tasklet); if (packet->ack != 0) goto out; @@ -3465,7 +3465,7 @@ static int ohci_flush_iso_completions(st struct iso_context *ctx = container_of(base, struct iso_context, base); int ret = 0; - tasklet_disable(&ctx->context.tasklet); + tasklet_disable_in_atomic(&ctx->context.tasklet); if (!test_and_set_bit_lock(0, &ctx->flushing_completions)) { context_tasklet((unsigned long)&ctx->context);
[patch 11/14] atm: eni: Use tasklet_disable_in_atomic() in the send() callback
From: Sebastian Andrzej Siewior The atmdev_ops::send callback which calls tasklet_disable() is invoked with bottom halfs disabled from net_device_ops::ndo_start_xmit(). All other invocations of tasklet_disable() in this driver happen in preemptible context. Change the send() call to use tasklet_disable_in_atomic() which allows tasklet_disable() to be made sleepable once the remaining atomic context usage sites are cleaned up. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: Chas Williams <3ch...@gmail.com> Cc: linux-atm-gene...@lists.sourceforge.net Cc: netdev@vger.kernel.org --- drivers/atm/eni.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/atm/eni.c +++ b/drivers/atm/eni.c @@ -2054,7 +2054,7 @@ static int eni_send(struct atm_vcc *vcc, } submitted++; ATM_SKB(skb)->vcc = vcc; - tasklet_disable(&ENI_DEV(vcc->dev)->task); + tasklet_disable_in_atomic(&ENI_DEV(vcc->dev)->task); res = do_tx(skb); tasklet_enable(&ENI_DEV(vcc->dev)->task); if (res == enq_ok) return 0;
[patch 10/14] ath9k: Use tasklet_disable_in_atomic()
From: Sebastian Andrzej Siewior All callers of ath9k_beacon_ensure_primary_slot() are preemptible / acquire a mutex except for this callchain: spin_lock_bh(&sc->sc_pcu_lock); ath_complete_reset() -> ath9k_calculate_summary_state() -> ath9k_beacon_ensure_primary_slot() It's unclear how that can be distangled, so use tasklet_disable_in_atomic() for now. This allows tasklet_disable() to become sleepable once the remaining atomic users are cleaned up. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: ath9k-de...@qca.qualcomm.com Cc: Kalle Valo Cc: "David S. Miller" Cc: Jakub Kicinski Cc: linux-wirel...@vger.kernel.org Cc: netdev@vger.kernel.org --- drivers/net/wireless/ath/ath9k/beacon.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/wireless/ath/ath9k/beacon.c +++ b/drivers/net/wireless/ath/ath9k/beacon.c @@ -251,7 +251,7 @@ void ath9k_beacon_ensure_primary_slot(st int first_slot = ATH_BCBUF; int slot; - tasklet_disable(&sc->bcon_tasklet); + tasklet_disable_in_atomic(&sc->bcon_tasklet); /* Find first taken slot. */ for (slot = 0; slot < ATH_BCBUF; slot++) {
[patch 08/14] net: jme: Replace link-change tasklet with work
From: Sebastian Andrzej Siewior The link change tasklet disables the tasklets for tx/rx processing while upating hw parameters and then enables the tasklets again. This update can also be pushed into a workqueue where it can be performed in preemptible context. This allows tasklet_disable() to become sleeping. Replace the linkch_task tasklet with a work. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner --- drivers/net/ethernet/jme.c | 10 +- drivers/net/ethernet/jme.h |2 +- 2 files changed, 6 insertions(+), 6 deletions(-) --- a/drivers/net/ethernet/jme.c +++ b/drivers/net/ethernet/jme.c @@ -1265,9 +1265,9 @@ jme_stop_shutdown_timer(struct jme_adapt jwrite32f(jme, JME_APMC, apmc); } -static void jme_link_change_tasklet(struct tasklet_struct *t) +static void jme_link_change_work(struct work_struct *work) { - struct jme_adapter *jme = from_tasklet(jme, t, linkch_task); + struct jme_adapter *jme = container_of(work, struct jme_adapter, linkch_task); struct net_device *netdev = jme->dev; int rc; @@ -1510,7 +1510,7 @@ jme_intr_msi(struct jme_adapter *jme, u3 * all other events are ignored */ jwrite32(jme, JME_IEVE, intrstat); - tasklet_schedule(&jme->linkch_task); + schedule_work(&jme->linkch_task); goto out_reenable; } @@ -1832,7 +1832,6 @@ jme_open(struct net_device *netdev) jme_clear_pm_disable_wol(jme); JME_NAPI_ENABLE(jme); - tasklet_setup(&jme->linkch_task, jme_link_change_tasklet); tasklet_setup(&jme->txclean_task, jme_tx_clean_tasklet); tasklet_setup(&jme->rxclean_task, jme_rx_clean_tasklet); tasklet_setup(&jme->rxempty_task, jme_rx_empty_tasklet); @@ -1920,7 +1919,7 @@ jme_close(struct net_device *netdev) JME_NAPI_DISABLE(jme); - tasklet_kill(&jme->linkch_task); + cancel_work_sync(&jme->linkch_task); tasklet_kill(&jme->txclean_task); tasklet_kill(&jme->rxclean_task); tasklet_kill(&jme->rxempty_task); @@ -3035,6 +3034,7 @@ jme_init_one(struct pci_dev *pdev, atomic_set(&jme->rx_empty, 1); tasklet_setup(&jme->pcc_task, jme_pcc_tasklet); + INIT_WORK(&jme->linkch_task, jme_link_change_work); jme->dpi.cur = PCC_P1; jme->reg_ghc = 0; --- a/drivers/net/ethernet/jme.h +++ b/drivers/net/ethernet/jme.h @@ -411,7 +411,7 @@ struct jme_adapter { struct tasklet_struct rxempty_task; struct tasklet_struct rxclean_task; struct tasklet_struct txclean_task; - struct tasklet_struct linkch_task; + struct work_struct linkch_task; struct tasklet_struct pcc_task; unsigned long flags; u32 reg_txcs;
[patch 09/14] net: sundance: Use tasklet_disable_in_atomic().
From: Sebastian Andrzej Siewior tasklet_disable() is used in the timer callback. This might be distangled, but without access to the hardware that's a bit risky. Replace it with tasklet_disable_in_atomic() so tasklet_disable() can be changed to a sleep wait once all remaining atomic users are converted. Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Thomas Gleixner Cc: Denis Kirjanov Cc: "David S. Miller" Cc: Jakub Kicinski Cc: netdev@vger.kernel.org --- drivers/net/ethernet/dlink/sundance.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/dlink/sundance.c +++ b/drivers/net/ethernet/dlink/sundance.c @@ -963,7 +963,7 @@ static void tx_timeout(struct net_device unsigned long flag; netif_stop_queue(dev); - tasklet_disable(&np->tx_tasklet); + tasklet_disable_in_atomic(&np->tx_tasklet); iowrite16(0, ioaddr + IntrEnable); printk(KERN_WARNING "%s: Transmit timed out, TxStatus %2.2x " "TxFrameId %2.2x,"
[patch 07/14] tasklets: Prevent tasklet_unlock_spin_wait() deadlock on RT
tasklet_unlock_spin_wait() spin waits for the TASKLET_STATE_SCHED bit in the tasklet state to be cleared. This works on !RT nicely because the corresponding execution can only happen on a different CPU. On RT softirq processing is preemptible, therefore a task preempting the softirq processing thread can spin forever. Prevent this by invoking local_bh_disable()/enable() inside the loop. In case that the softirq processing thread was preempted by the current task, current will block on the local lock which yields the CPU to the preempted softirq processing thread. If the tasklet is processed on a different CPU then the local_bh_disable()/enable() pair is just a waste of processor cycles. Signed-off-by: Thomas Gleixner Tested-by: Sebastian Andrzej Siewior --- include/linux/interrupt.h |2 +- kernel/softirq.c | 28 +++- 2 files changed, 28 insertions(+), 2 deletions(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -658,7 +658,7 @@ enum TASKLET_STATE_RUN /* Tasklet is running (SMP only) */ }; -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) static inline int tasklet_trylock(struct tasklet_struct *t) { return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -616,6 +616,32 @@ void tasklet_init(struct tasklet_struct } EXPORT_SYMBOL(tasklet_init); +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) +/* + * Do not use in new code. There is no real reason to invoke this from + * atomic contexts. + */ +void tasklet_unlock_spin_wait(struct tasklet_struct *t) +{ + while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { + if (IS_ENABLED(CONFIG_PREEMPT_RT)) { + /* +* Prevent a live lock when current preempted soft +* interrupt processing or prevents ksoftirqd from +* running. If the tasklet runs on a different CPU +* then this has no effect other than doing the BH +* disable/enable dance for nothing. +*/ + local_bh_disable(); + local_bh_enable(); + } else { + cpu_relax(); + } + } +} +EXPORT_SYMBOL(tasklet_unlock_spin_wait); +#endif + void tasklet_kill(struct tasklet_struct *t) { if (in_interrupt()) @@ -629,7 +655,7 @@ void tasklet_kill(struct tasklet_struct } EXPORT_SYMBOL(tasklet_kill); -#ifdef CONFIG_SMP +#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT) void tasklet_unlock(struct tasklet_struct *t) { smp_mb__before_atomic();
[patch 05/14] tasklets: Replace spin wait in tasklet_unlock_wait()
From: Peter Zijlstra tasklet_unlock_wait() spin waits for TASKLET_STATE_RUN to be cleared. This is wasting CPU cycles in a tight loop which is especially painful in a guest when the CPU running the tasklet is scheduled out. tasklet_unlock_wait() is invoked from tasklet_kill() which is used in teardown paths and not performance critical at all. Replace the spin wait with wait_var_event(). There are no users of tasklet_unlock_wait() which are invoked from atomic contexts. The usage in tasklet_disable() has been replaced temporarily with the spin waiting variant until the atomic users are fixed up and will be converted to the sleep wait variant later. Signed-off-by: Peter Zijlstra Signed-off-by: Thomas Gleixner --- include/linux/interrupt.h | 13 ++--- kernel/softirq.c | 18 ++ 2 files changed, 20 insertions(+), 11 deletions(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -664,17 +664,8 @@ static inline int tasklet_trylock(struct return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); } -static inline void tasklet_unlock(struct tasklet_struct *t) -{ - smp_mb__before_atomic(); - clear_bit(TASKLET_STATE_RUN, &(t)->state); -} - -static inline void tasklet_unlock_wait(struct tasklet_struct *t) -{ - while (test_bit(TASKLET_STATE_RUN, &t->state)) - cpu_relax(); -} +void tasklet_unlock(struct tasklet_struct *t); +void tasklet_unlock_wait(struct tasklet_struct *t); /* * Do not use in new code. Waiting for tasklets from atomic contexts is --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -25,6 +25,7 @@ #include #include #include +#include #include @@ -621,6 +622,23 @@ void tasklet_kill(struct tasklet_struct } EXPORT_SYMBOL(tasklet_kill); +#ifdef CONFIG_SMP +void tasklet_unlock(struct tasklet_struct *t) +{ + smp_mb__before_atomic(); + clear_bit(TASKLET_STATE_RUN, &t->state); + smp_mb__after_atomic(); + wake_up_var(&t->state); +} +EXPORT_SYMBOL_GPL(tasklet_unlock); + +void tasklet_unlock_wait(struct tasklet_struct *t) +{ + wait_var_event(&t->state, !test_bit(TASKLET_STATE_RUN, &t->state)); +} +EXPORT_SYMBOL_GPL(tasklet_unlock_wait); +#endif + void __init softirq_init(void) { int cpu;
[patch 04/14] tasklets: Use spin wait in tasklet_disable() temporarily
To ease the transition use spin waiting in tasklet_disable() until all usage sites from atomic context have been cleaned up. Signed-off-by: Thomas Gleixner --- include/linux/interrupt.h |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -728,7 +728,8 @@ static inline void tasklet_disable_in_at static inline void tasklet_disable(struct tasklet_struct *t) { tasklet_disable_nosync(t); - tasklet_unlock_wait(t); + /* Spin wait until all atomic users are converted */ + tasklet_unlock_spin_wait(t); smp_mb(); }
[patch 01/14] tasklets: Replace barrier() with cpu_relax() in tasklet_unlock_wait()
A barrier() in a tight loop which waits for something to happen on a remote CPU is a pointless exercise. Replace it with cpu_relax() which allows HT siblings to make progress. Signed-off-by: Thomas Gleixner Tested-by: Sebastian Andrzej Siewior --- include/linux/interrupt.h |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -672,7 +672,8 @@ static inline void tasklet_unlock(struct static inline void tasklet_unlock_wait(struct tasklet_struct *t) { - while (test_bit(TASKLET_STATE_RUN, &(t)->state)) { barrier(); } + while (test_bit(TASKLET_STATE_RUN, &t->state)) + cpu_relax(); } #else #define tasklet_trylock(t) 1
[patch 00/14] tasklets: Replace the spin wait loops and make it RT safe
This is a follow up to the review comments of the series which makes softirq processing PREEMPT_RT safe: https://lore.kernel.org/r/20201207114743.gk3...@hirez.programming.kicks-ass.net Peter suggested to replace the spin waiting in tasklet_disable() and tasklet_kill() with wait_event(). This also gets rid of the ill defined sched_yield() in tasklet_kill(). Analyzing all usage sites of tasklet_disable() and tasklet_unlock_wait() we found that most of them are safe to be converted to a sleeping wait. Only a few instances invoke tasklet_disable() from atomic context. A few bugs which have been found in course of this analysis have been already addressed seperately. The following series takes the following approach: 1) Provide a variant of tasklet_disable() which can be invoked from atomic contexts 2) Convert the usage sites which cannot be easily changed to a sleepable wait to use this new function 3) Replace the spin waits in tasklet_disable() and tasklet_kill() with sleepable variants. If this is agreed on then the merging can be either done in bulk or the first 4 patches could be applied on top of rc2 and tagged for consumption in the relevant subsystem trees (networking, pci, firewire). In this case the last patch which changes the implementation of tasklet_disable() has to be post-poned until all other changes have reached mainline. The series is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git tasklet-2021-03-09 Thanks, tglx
[patch 03/14] tasklets: Provide tasklet_disable_in_atomic()
Replacing the spin wait loops in tasklet_unlock_wait() with wait_var_event() is not possible as a handful of tasklet_disable() invocations are happening in atomic context. All other invocations are in teardown paths which can sleep. Provide tasklet_disable_in_atomic() and tasklet_unlock_spin_wait() to convert the few atomic use cases over, which allows to change tasklet_disable() and tasklet_unlock_wait() in a later step. Signed-off-by: Thomas Gleixner --- include/linux/interrupt.h | 22 ++ 1 file changed, 22 insertions(+) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -675,10 +675,21 @@ static inline void tasklet_unlock_wait(s while (test_bit(TASKLET_STATE_RUN, &t->state)) cpu_relax(); } + +/* + * Do not use in new code. Waiting for tasklets from atomic contexts is + * error prone and should be avoided. + */ +static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t) +{ + while (test_bit(TASKLET_STATE_RUN, &t->state)) + cpu_relax(); +} #else static inline int tasklet_trylock(struct tasklet_struct *t) { return 1; } static inline void tasklet_unlock(struct tasklet_struct *t) { } static inline void tasklet_unlock_wait(struct tasklet_struct *t) { } +static inline void tasklet_unlock_spin_wait(struct tasklet_struct *t) { } #endif extern void __tasklet_schedule(struct tasklet_struct *t); @@ -703,6 +714,17 @@ static inline void tasklet_disable_nosyn smp_mb__after_atomic(); } +/* + * Do not use in new code. Disabling tasklets from atomic contexts is + * error prone and should be avoided. + */ +static inline void tasklet_disable_in_atomic(struct tasklet_struct *t) +{ + tasklet_disable_nosync(t); + tasklet_unlock_spin_wait(t); + smp_mb(); +} + static inline void tasklet_disable(struct tasklet_struct *t) { tasklet_disable_nosync(t);
[patch 02/14] tasklets: Use static inlines for stub implementations
Inlines exist for a reason. Signed-off-by: Thomas Gleixner Tested-by: Sebastian Andrzej Siewior --- include/linux/interrupt.h |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -676,9 +676,9 @@ static inline void tasklet_unlock_wait(s cpu_relax(); } #else -#define tasklet_trylock(t) 1 -#define tasklet_unlock_wait(t) do { } while (0) -#define tasklet_unlock(t) do { } while (0) +static inline int tasklet_trylock(struct tasklet_struct *t) { return 1; } +static inline void tasklet_unlock(struct tasklet_struct *t) { } +static inline void tasklet_unlock_wait(struct tasklet_struct *t) { } #endif extern void __tasklet_schedule(struct tasklet_struct *t);
Re: [patch 02/30] genirq: Move status flag checks to core
On Sun, Dec 27 2020 at 11:20, Guenter Roeck wrote: > On Thu, Dec 10, 2020 at 08:25:38PM +0100, Thomas Gleixner wrote: > Yes, but that means that irq_check_status_bit() may be called from modules, > but it is not exported, resulting in build errors such as the following. > > arm64:allmodconfig: > > ERROR: modpost: "irq_check_status_bit" [drivers/perf/arm_spe_pmu.ko] > undefined! Duh. Yes, that lacks an export obviously. Thanks, tglx
Re: [stabe-rc 5.9 ] sched: core.c:7270 Illegal context switch in RCU-bh read-side critical section!
On Wed, Dec 16 2020 at 15:55, Naresh Kamboju wrote: > On Tue, 15 Dec 2020 at 23:52, Jakub Kicinski wrote: >> > Or you could place checks for being in a BH-disable further up in >> > the code. Or build with CONFIG_DEBUG_INFO=y to allow more precise >> > interpretation of this stack trace. > > I will try to reproduce this warning with DEBUG_INFO=y enabled kernel and > get back to you with a better crash log. > >> >> My money would be on the option that whatever run on this workqueue >> before forgot to re-enable BH, but we already have a check for that... >> Naresh, do you have the full log? Is there nothing like "BUG: workqueue >> leaked lock" above the splat? No, because it's in the middle of the work. The workqueue bug triggers when the work has finished. So cleanup_up() net does synchronize_rcu(); <- might sleep. So up to here it should be fine. list_for_each_entry_continue_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list); ops_exit_list() is called for each ops which then either invokes ops->exit() or ops->exit_batch(). So one of those callbacks fails to reenable BH, so adding a check after each invocation of ops->exit() and ops->exit_batch() for !local_bh_disabled() should be able to identify the buggy callback. Thanks, tglx
Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts
Andrew, On Fri, Dec 11 2020 at 22:21, Andrew Cooper wrote: > On 11/12/2020 21:27, Thomas Gleixner wrote: >> It's not any different from the hardware example at least not as far as >> I understood the code. > > Xen's event channels do have a couple of quirks. Why am I not surprised? > Binding an event channel always results in one spurious event being > delivered. This is to cover notifications which can get lost during the > bidirectional setup, or re-setups in certain configurations. > > Binding an interdomain or pirq event channel always defaults to vCPU0. > There is no way to atomically set the affinity while binding. I believe > the API predates SMP guest support in Xen, and noone has fixed it up > since. That's fine. I'm not changing that. What I'm changing is the unwanted and unnecessary overwriting of the actual affinity mask. We have a similar issue on real hardware where we can only target _one_ CPU and not all CPUs in the affinity mask. So we still can preserve the (user) requested mask and just affine it to one CPU which is reflected in the effective affinity mask. This the right thing to do for two reasons: 1) It allows proper interrupt distribution 2) It does not break (user) requested affinity when the effective target CPU goes offline and the affinity mask still contains online CPUs. If you overwrite it you lost track of the requested broader mask. > As a consequence, the guest will observe the event raised on vCPU0 as > part of setting up the event, even if it attempts to set a different > affinity immediately afterwards. A little bit of care needs to be taken > when binding an event channel on vCPUs other than 0, to ensure that the > callback is safe with respect to any remaining state needing > initialisation. That's preserved for all non percpu interrupts. The percpu variant of VIRQ and IPIs did binding to vCPU != 0 already before this change. > Beyond this, there is nothing magic I'm aware of. > > We have seen soft lockups before in certain scenarios, simply due to the > quantity of events hitting vCPU0 before irqbalance gets around to > spreading the load. This is why there is an attempt to round-robin the > userspace event channel affinities by default, but I still don't see why > this would need custom affinity logic itself. Just the previous attempt makes no sense for the reasons I outlined in the changelog. So now with this new spreading mechanics you get the distribution for all cases: 1) Post setup using and respecting the default affinity mask which can be set as a kernel commandline parameter. 2) Runtime (user) requested affinity change with a mask which contains more than one vCPU. The previous logic always chose the first one in the mask. So assume userspace affines 4 irqs to a CPU 0-3 and 4 irqs to CPU 4-7 then 4 irqs end up on CPU0 and 4 on CPU4 The new algorithm which is similar to what we have on x86 (minus the vector space limitation) picks the CPU which has the least number of channels affine to it at that moment. If e.g. all 8 CPUs have the same number of vectors before that change then in the example above the first 4 are spread to CPU0-3 and the second 4 to CPU4-7 Thanks, tglx
Re: [patch 03/30] genirq: Move irq_set_lockdep_class() to core
On Fri, Dec 11 2020 at 22:08, Thomas Gleixner wrote: > On Fri, Dec 11 2020 at 19:53, Andy Shevchenko wrote: > >> On Thu, Dec 10, 2020 at 10:14 PM Thomas Gleixner wrote: >>> >>> irq_set_lockdep_class() is used from modules and requires irq_to_desc() to >>> be exported. Move it into the core code which lifts another requirement for >>> the export. >> >> ... >> >>> + if (IS_ENABLED(CONFIG_LOCKDEP)) >>> + __irq_set_lockdep_class(irq, lock_class, request_class); > > You are right. Let me fix that. No. I have to correct myself. You're wrong. The inline is evaluated in the compilation units which include that header and because the function declaration is unconditional it is happy. Now the optimizer stage makes the whole thing a NOOP if CONFIG_LOCKDEP=n and thereby drops the reference to the function which makes it not required for linking. So in the file where the function is implemented: #ifdef CONFIG_LOCKDEP void __irq_set_lockdep_class() { } #endif The whole block is either discarded because CONFIG_LOCKDEP is not defined or compile if it is defined which makes it available for the linker. And in the latter case the optimizer keeps the call in the inline (it optimizes the condition away because it's always true). So in both cases the compiler and the linker are happy and everything works as expected. It would fail if the header file had the following: #ifdef CONFIG_LOCKDEP void __irq_set_lockdep_class(); #endif Because then it would complain about the missing function prototype when it evaluates the inline. Thanks, tglx
Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts
On Fri, Dec 11 2020 at 09:29, boris ostrovsky wrote: > On 12/11/20 7:37 AM, Thomas Gleixner wrote: >> On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote: >>> On 11.12.20 00:20, boris.ostrov...@oracle.com wrote: >>>> On 12/10/20 2:26 PM, Thomas Gleixner wrote: >>>>> Change the implementation so that the channel is bound to CPU0 at the XEN >>>>> level and leave the affinity mask alone. At startup of the interrupt >>>>> affinity will be assigned out of the affinity mask and the XEN binding >>>>> will >>>>> be updated. >>>> >>>> If that's the case then I wonder whether we need this call at all and >>>> instead bind at startup time. >>> After some discussion with Thomas on IRC and xen-devel archaeology the >>> result is: this will be needed especially for systems running on a >>> single vcpu (e.g. small guests), as the .irq_set_affinity() callback >>> won't be called in this case when starting the irq. > > On UP are we not then going to end up with an empty affinity mask? Or > are we guaranteed to have it set to 1 by interrupt generic code? An UP kernel does not ever look on the affinity mask. The chip::irq_set_affinity() callback is not invoked so the mask is irrelevant. A SMP kernel on a UP machine sets CPU0 in the mask so all is good. > This is actually why I brought this up in the first place --- a > potential mismatch between the affinity mask and Xen-specific data > (e.g. info->cpu and then protocol-specific data in event channel > code). Even if they are re-synchronized later, at startup time (for > SMP). Which is not a problem either. The affinity mask is only relevant for setting the affinity, but it's not relevant for delivery and never can be. > I don't see anything that would cause a problem right now but I worry > that this inconsistency may come up at some point. As long as the affinity mask becomes not part of the event channel magic this should never matter. Look at it from hardware: interrupt is affine to CPU0 CPU0 runs: set_affinity(CPU0 -> CPU1) local_irq_disable() --> interrupt is raised in hardware and pending on CPU0 irq hardware is reconfigured to be affine to CPU1 local_irq_enable() --> interrupt is handled on CPU0 the next interrupt will be raised on CPU1 So info->cpu which is registered via the hypercall binds the 'hardware delivery' and whenever the new affinity is written it is rebound to some other CPU and the next interrupt is then raised on this other CPU. It's not any different from the hardware example at least not as far as I understood the code. Thanks, tglx
RE: [patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy
On Fri, Dec 11 2020 at 14:19, David Laight wrote: > From: Thomas Gleixner >> You can't catch that. If this really becomes an issue you need a >> sequence counter around it. > > Or just two copies of the high word. > Provided the accesses are sequenced: > writer: > load high:low > add small_value,high:low > store high > store low > store high_copy > reader: > load high_copy > load low > load high > if (high != high_copy) > low = 0; And low = 0 is solving what? You need to loop back and retry until it's consistent and then it's nothing else than an open coded sequence count. Thanks, tglx
Re: [patch 03/30] genirq: Move irq_set_lockdep_class() to core
On Fri, Dec 11 2020 at 19:53, Andy Shevchenko wrote: > On Thu, Dec 10, 2020 at 10:14 PM Thomas Gleixner wrote: >> >> irq_set_lockdep_class() is used from modules and requires irq_to_desc() to >> be exported. Move it into the core code which lifts another requirement for >> the export. > > ... > >> + if (IS_ENABLED(CONFIG_LOCKDEP)) >> + __irq_set_lockdep_class(irq, lock_class, request_class); You are right. Let me fix that.
Re: [patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy
On Fri, Dec 11 2020 at 10:13, Tvrtko Ursulin wrote: > On 10/12/2020 19:25, Thomas Gleixner wrote: >> >> Aside of that the count is per interrupt line and therefore takes >> interrupts from other devices into account which share the interrupt line >> and are not handled by the graphics driver. >> >> Replace it with a pmu private count which only counts interrupts which >> originate from the graphics card. >> >> To avoid atomics or heuristics of some sort make the counter field >> 'unsigned long'. That limits the count to 4e9 on 32bit which is a lot and >> postprocessing can easily deal with the occasional wraparound. > > After my failed hasty sketch from last night I had a different one which > was kind of heuristics based (re-reading the upper dword and retrying if > it changed on 32-bit). The problem is that there will be two seperate modifications for the low and high word. Several ways how the compiler can translate this, but the problem is the same for all of them: CPU 0 CPU 1 load low load high add low, 1 addc high, 0 store low load high --> NMI load low load high and compare store high You can't catch that. If this really becomes an issue you need a sequence counter around it. > But you are right - it is okay to at least start > like this today and if later there is a need we can either do that or > deal with wrap at PMU read time. Right. >> +/* >> + * Interrupt statistic for PMU. Increments the counter only if the >> + * interrupt originated from the the GPU so interrupts from a device which >> + * shares the interrupt line are not accounted. >> + */ >> +static inline void pmu_irq_stats(struct drm_i915_private *priv, > > We never use priv as a local name, it should be either i915 or > dev_priv. Sure, will fix. >> +/* >> + * A clever compiler translates that into INC. A not so clever one >> + * should at least prevent store tearing. >> + */ >> +WRITE_ONCE(priv->pmu.irq_count, priv->pmu.irq_count + 1); > > Curious, probably more educational for me - given x86_32 and x86_64, and > the context of it getting called, what is the difference from just doing > irq_count++? Several reasons: 1) The compiler can pretty much do what it wants with cnt++ including tearing and whatever. https://lwn.net/Articles/816850/ for the full set of insanities. Not really a problem here, but 2) It's annotating the reader and the writer side and documenting that this is subject to concurrency 3) It will prevent KCSAN to complain about the data race, i.e. concurrent modification while reading. Thanks, tglx >> --- a/drivers/gpu/drm/i915/i915_pmu.c >> +++ b/drivers/gpu/drm/i915/i915_pmu.c >> @@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample( >> return HRTIMER_RESTART; >> } > > In this file you can also drop the #include line. Indeed. Thanks, tglx
Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts
On Fri, Dec 11 2020 at 13:10, Jürgen Groß wrote: > On 11.12.20 00:20, boris.ostrov...@oracle.com wrote: >> >> On 12/10/20 2:26 PM, Thomas Gleixner wrote: >>> All event channel setups bind the interrupt on CPU0 or the target CPU for >>> percpu interrupts and overwrite the affinity mask with the corresponding >>> cpumask. That does not make sense. >>> >>> The XEN implementation of irqchip::irq_set_affinity() already picks a >>> single target CPU out of the affinity mask and the actual target is stored >>> in the effective CPU mask, so destroying the user chosen affinity mask >>> which might contain more than one CPU is wrong. >>> >>> Change the implementation so that the channel is bound to CPU0 at the XEN >>> level and leave the affinity mask alone. At startup of the interrupt >>> affinity will be assigned out of the affinity mask and the XEN binding will >>> be updated. >> >> >> If that's the case then I wonder whether we need this call at all and >> instead bind at startup time. > > After some discussion with Thomas on IRC and xen-devel archaeology the > result is: this will be needed especially for systems running on a > single vcpu (e.g. small guests), as the .irq_set_affinity() callback > won't be called in this case when starting the irq. That's right, but not limited to ARM. The same problem exists on x86 UP. So yes, the call makes sense, but the changelog is not really useful. Let me add a comment to this. Thanks, tglx
Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts
On Fri, Dec 11 2020 at 07:17, Jürgen Groß wrote: > On 11.12.20 00:20, boris.ostrov...@oracle.com wrote: >> >> On 12/10/20 2:26 PM, Thomas Gleixner wrote: >>> All event channel setups bind the interrupt on CPU0 or the target CPU for >>> percpu interrupts and overwrite the affinity mask with the corresponding >>> cpumask. That does not make sense. >>> >>> The XEN implementation of irqchip::irq_set_affinity() already picks a >>> single target CPU out of the affinity mask and the actual target is stored >>> in the effective CPU mask, so destroying the user chosen affinity mask >>> which might contain more than one CPU is wrong. >>> >>> Change the implementation so that the channel is bound to CPU0 at the XEN >>> level and leave the affinity mask alone. At startup of the interrupt >>> affinity will be assigned out of the affinity mask and the XEN binding will >>> be updated. >> >> >> If that's the case then I wonder whether we need this call at all and >> instead bind at startup time. > > This binding to cpu0 was introduced with commit 97253eeeb792d61ed2 > and I have no reason to believe the underlying problem has been > eliminated. "The kernel-side VCPU binding was not being correctly set for newly allocated or bound interdomain events. In ARM guests where 2-level events were used, this would result in no interdomain events being handled because the kernel-side VCPU masks would all be clear. x86 guests would work because the irq affinity was set during irq setup and this would set the correct kernel-side VCPU binding." I'm not convinced that this is really correctly analyzed because affinity setting is done at irq startup. switch (__irq_startup_managed(desc, aff, force)) { case IRQ_STARTUP_NORMAL: ret = __irq_startup(desc); irq_setup_affinity(desc); break; which is completely architecture agnostic. So why should this magically work on x86 and not on ARM if both are using the same XEN irqchip with the same irqchip callbacks. Thanks, tglx
Re: [patch 27/30] xen/events: Only force affinity mask for percpu interrupts
On Thu, Dec 10 2020 at 18:20, boris ostrovsky wrote: > On 12/10/20 2:26 PM, Thomas Gleixner wrote: >> All event channel setups bind the interrupt on CPU0 or the target CPU for >> percpu interrupts and overwrite the affinity mask with the corresponding >> cpumask. That does not make sense. >> >> The XEN implementation of irqchip::irq_set_affinity() already picks a >> single target CPU out of the affinity mask and the actual target is stored >> in the effective CPU mask, so destroying the user chosen affinity mask >> which might contain more than one CPU is wrong. >> >> Change the implementation so that the channel is bound to CPU0 at the XEN >> level and leave the affinity mask alone. At startup of the interrupt >> affinity will be assigned out of the affinity mask and the XEN binding will >> be updated. > > If that's the case then I wonder whether we need this call at all and > instead bind at startup time. I was wondering about that, but my knowledge about the Xen internal requirements is pretty limited. The current set at least survived basic testing by Jürgen. Thanks, tglx
Re: [patch 24/30] xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
On Thu, Dec 10 2020 at 18:19, boris ostrovsky wrote: > On 12/10/20 2:26 PM, Thomas Gleixner wrote: >> -EXPORT_SYMBOL_GPL(bind_evtchn_to_irq_lateeoi); > > include/xen/events.h also needs to be updated (and in the next patch for > xen_set_affinity_evtchn() as well). Darn, I lost that.
[patch 01/30] genirq: Move irq_has_action() into core code
This function uses irq_to_desc() and is going to be used by modules to replace the open coded irq_to_desc() (ab)usage. The final goal is to remove the export of irq_to_desc() so driver cannot fiddle with it anymore. Move it into the core code and fixup the usage sites to include the proper header. Signed-off-by: Thomas Gleixner --- arch/alpha/kernel/sys_jensen.c |2 +- arch/x86/kernel/topology.c |1 + include/linux/interrupt.h |1 + include/linux/irqdesc.h|7 +-- kernel/irq/manage.c| 17 + 5 files changed, 21 insertions(+), 7 deletions(-) --- a/arch/alpha/kernel/sys_jensen.c +++ b/arch/alpha/kernel/sys_jensen.c @@ -7,7 +7,7 @@ * * Code supporting the Jensen. */ - +#include #include #include #include --- a/arch/x86/kernel/topology.c +++ b/arch/x86/kernel/topology.c @@ -25,6 +25,7 @@ * * Send feedback to */ +#include #include #include #include --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -232,6 +232,7 @@ extern void devm_free_irq(struct device # define local_irq_enable_in_hardirq() local_irq_enable() #endif +bool irq_has_action(unsigned int irq); extern void disable_irq_nosync(unsigned int irq); extern bool disable_hardirq(unsigned int irq); extern void disable_irq(unsigned int irq); --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -179,12 +179,7 @@ int handle_domain_nmi(struct irq_domain /* Test to see if a driver has successfully requested an irq */ static inline int irq_desc_has_action(struct irq_desc *desc) { - return desc->action != NULL; -} - -static inline int irq_has_action(unsigned int irq) -{ - return irq_desc_has_action(irq_to_desc(irq)); + return desc && desc->action != NULL; } /** --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -2752,3 +2752,20 @@ int irq_set_irqchip_state(unsigned int i return err; } EXPORT_SYMBOL_GPL(irq_set_irqchip_state); + +/** + * irq_has_action - Check whether an interrupt is requested + * @irq: The linux irq number + * + * Returns: A snapshot of the current state + */ +bool irq_has_action(unsigned int irq) +{ + bool res; + + rcu_read_lock(); + res = irq_desc_has_action(irq_to_desc(irq)); + rcu_read_unlock(); + return res; +} +EXPORT_SYMBOL_GPL(irq_has_action);
[patch 03/30] genirq: Move irq_set_lockdep_class() to core
irq_set_lockdep_class() is used from modules and requires irq_to_desc() to be exported. Move it into the core code which lifts another requirement for the export. Signed-off-by: Thomas Gleixner --- include/linux/irqdesc.h | 10 -- kernel/irq/irqdesc.c| 14 ++ 2 files changed, 18 insertions(+), 6 deletions(-) --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -240,16 +240,14 @@ static inline bool irq_is_percpu_devid(u return irq_check_status_bit(irq, IRQ_PER_CPU_DEVID); } +void __irq_set_lockdep_class(unsigned int irq, struct lock_class_key *lock_class, +struct lock_class_key *request_class); static inline void irq_set_lockdep_class(unsigned int irq, struct lock_class_key *lock_class, struct lock_class_key *request_class) { - struct irq_desc *desc = irq_to_desc(irq); - - if (desc) { - lockdep_set_class(&desc->lock, lock_class); - lockdep_set_class(&desc->request_mutex, request_class); - } + if (IS_ENABLED(CONFIG_LOCKDEP)) + __irq_set_lockdep_class(irq, lock_class, request_class); } #endif --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -968,3 +968,17 @@ unsigned int kstat_irqs_usr(unsigned int rcu_read_unlock(); return sum; } + +#ifdef CONFIG_LOCKDEP +void __irq_set_lockdep_class(unsigned int irq, struct lock_class_key *lock_class, +struct lock_class_key *request_class) +{ + struct irq_desc *desc = irq_to_desc(irq); + + if (desc) { + lockdep_set_class(&desc->lock, lock_class); + lockdep_set_class(&desc->request_mutex, request_class); + } +} +EXPORT_SYMBOL_GPL(irq_set_lockdep_class); +#endif
[patch 06/30] parisc/irq: Simplify irq count output for /proc/interrupts
The SMP variant works perfectly fine on UP as well. Signed-off-by: Thomas Gleixner Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: afzal mohammed Cc: linux-par...@vger.kernel.org --- arch/parisc/kernel/irq.c |5 + 1 file changed, 1 insertion(+), 4 deletions(-) --- a/arch/parisc/kernel/irq.c +++ b/arch/parisc/kernel/irq.c @@ -216,12 +216,9 @@ int show_interrupts(struct seq_file *p, if (!action) goto skip; seq_printf(p, "%3d: ", i); -#ifdef CONFIG_SMP + for_each_online_cpu(j) seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); -#else - seq_printf(p, "%10u ", kstat_irqs(i)); -#endif seq_printf(p, " %14s", irq_desc_get_chip(desc)->name); #ifndef PARISC_IRQ_CR16_COUNTS
[patch 07/30] genirq: Make kstat_irqs() static
No more users outside the core code. Signed-off-by: Thomas Gleixner --- include/linux/kernel_stat.h |1 - kernel/irq/irqdesc.c| 19 ++- 2 files changed, 6 insertions(+), 14 deletions(-) --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -67,7 +67,6 @@ static inline unsigned int kstat_softirq /* * Number of interrupts per specific IRQ source, since bootup */ -extern unsigned int kstat_irqs(unsigned int irq); extern unsigned int kstat_irqs_usr(unsigned int irq); /* --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -924,15 +924,7 @@ static bool irq_is_nmi(struct irq_desc * return desc->istate & IRQS_NMI; } -/** - * kstat_irqs - Get the statistics for an interrupt - * @irq: The interrupt number - * - * Returns the sum of interrupt counts on all cpus since boot for - * @irq. The caller must ensure that the interrupt is not removed - * concurrently. - */ -unsigned int kstat_irqs(unsigned int irq) +static unsigned int kstat_irqs(unsigned int irq) { struct irq_desc *desc = irq_to_desc(irq); unsigned int sum = 0; @@ -951,13 +943,14 @@ unsigned int kstat_irqs(unsigned int irq } /** - * kstat_irqs_usr - Get the statistics for an interrupt + * kstat_irqs_usr - Get the statistics for an interrupt from thread context * @irq: The interrupt number * * Returns the sum of interrupt counts on all cpus since boot for @irq. - * Contrary to kstat_irqs() this can be called from any context. - * It uses rcu since a concurrent removal of an interrupt descriptor is - * observing an rcu grace period before delayed_free_desc()/irq_kobj_release(). + * + * It uses rcu to protect the access since a concurrent removal of an + * interrupt descriptor is observing an rcu grace period before + * delayed_free_desc()/irq_kobj_release(). */ unsigned int kstat_irqs_usr(unsigned int irq) {
[patch 02/30] genirq: Move status flag checks to core
These checks are used by modules and prevent the removal of the export of irq_to_desc(). Move the accessor into the core. Signed-off-by: Thomas Gleixner --- include/linux/irqdesc.h | 17 + kernel/irq/manage.c | 17 + 2 files changed, 22 insertions(+), 12 deletions(-) --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -223,28 +223,21 @@ irq_set_chip_handler_name_locked(struct data->chip = chip; } +bool irq_check_status_bit(unsigned int irq, unsigned int bitmask); + static inline bool irq_balancing_disabled(unsigned int irq) { - struct irq_desc *desc; - - desc = irq_to_desc(irq); - return desc->status_use_accessors & IRQ_NO_BALANCING_MASK; + return irq_check_status_bit(irq, IRQ_NO_BALANCING_MASK); } static inline bool irq_is_percpu(unsigned int irq) { - struct irq_desc *desc; - - desc = irq_to_desc(irq); - return desc->status_use_accessors & IRQ_PER_CPU; + return irq_check_status_bit(irq, IRQ_PER_CPU); } static inline bool irq_is_percpu_devid(unsigned int irq) { - struct irq_desc *desc; - - desc = irq_to_desc(irq); - return desc->status_use_accessors & IRQ_PER_CPU_DEVID; + return irq_check_status_bit(irq, IRQ_PER_CPU_DEVID); } static inline void --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -2769,3 +2769,23 @@ bool irq_has_action(unsigned int irq) return res; } EXPORT_SYMBOL_GPL(irq_has_action); + +/** + * irq_check_status_bit - Check whether bits in the irq descriptor status are set + * @irq: The linux irq number + * @bitmask: The bitmask to evaluate + * + * Returns: True if one of the bits in @bitmask is set + */ +bool irq_check_status_bit(unsigned int irq, unsigned int bitmask) +{ + struct irq_desc *desc; + bool res = false; + + rcu_read_lock(); + desc = irq_to_desc(irq); + if (desc) + res = !!(desc->status_use_accessors & bitmask); + rcu_read_unlock(); + return res; +}
[patch 21/30] net/mlx4: Use effective interrupt affinity
Using the interrupt affinity mask for checking locality is not really working well on architectures which support effective affinity masks. The affinity mask is either the system wide default or set by user space, but the architecture can or even must reduce the mask to the effective set, which means that checking the affinity mask itself does not really tell about the actual target CPUs. Signed-off-by: Thomas Gleixner Cc: Tariq Toukan Cc: "David S. Miller" Cc: Jakub Kicinski Cc: netdev@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx4/en_cq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c @@ -117,7 +117,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p assigned_eq = true; } irq = mlx4_eq_get_irq(mdev->dev, cq->vector); - cq->aff_mask = irq_get_affinity_mask(irq); + cq->aff_mask = irq_get_effective_affinity_mask(irq); } else { /* For TX we use the same irq per ring we assigned for the RX*/
[patch 18/30] PCI: xilinx-nwl: Use irq_data_get_irq_chip_data()
Going through a full irq descriptor lookup instead of just using the proper helper function which provides direct access is suboptimal. In fact it _is_ wrong because the chip callback needs to get the chip data which is relevant for the chip while using the irq descriptor variant returns the irq chip data of the top level chip of a hierarchy. It does not matter in this case because the chip is the top level chip, but that doesn't make it more correct. Signed-off-by: Thomas Gleixner Cc: Lorenzo Pieralisi Cc: Rob Herring Cc: Bjorn Helgaas Cc: Michal Simek Cc: linux-...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org --- drivers/pci/controller/pcie-xilinx-nwl.c |8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) --- a/drivers/pci/controller/pcie-xilinx-nwl.c +++ b/drivers/pci/controller/pcie-xilinx-nwl.c @@ -379,13 +379,11 @@ static void nwl_pcie_msi_handler_low(str static void nwl_mask_leg_irq(struct irq_data *data) { - struct irq_desc *desc = irq_to_desc(data->irq); - struct nwl_pcie *pcie; + struct nwl_pcie *pcie = irq_data_get_irq_chip_data(data); unsigned long flags; u32 mask; u32 val; - pcie = irq_desc_get_chip_data(desc); mask = 1 << (data->hwirq - 1); raw_spin_lock_irqsave(&pcie->leg_mask_lock, flags); val = nwl_bridge_readl(pcie, MSGF_LEG_MASK); @@ -395,13 +393,11 @@ static void nwl_mask_leg_irq(struct irq_ static void nwl_unmask_leg_irq(struct irq_data *data) { - struct irq_desc *desc = irq_to_desc(data->irq); - struct nwl_pcie *pcie; + struct nwl_pcie *pcie = irq_data_get_irq_chip_data(data); unsigned long flags; u32 mask; u32 val; - pcie = irq_desc_get_chip_data(desc); mask = 1 << (data->hwirq - 1); raw_spin_lock_irqsave(&pcie->leg_mask_lock, flags); val = nwl_bridge_readl(pcie, MSGF_LEG_MASK);
[patch 25/30] xen/events: Remove disfunct affinity spreading
This function can only ever work when the event channels: - are already established - interrupts assigned to them - the affinity has been set by user space already because any newly set up event channel is forced to be bound to CPU0 and the affinity mask of the interrupt is forced to contain cpumask_of(0). As the CPU0 enforcement was in place _before_ this was implemented it's entirely unclear how that can ever have worked at all. Remove it as preparation for doing it proper. Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c |9 - drivers/xen/evtchn.c | 34 +- 2 files changed, 1 insertion(+), 42 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1696,15 +1696,6 @@ static int set_affinity_irq(struct irq_d return ret; } -/* To be called with desc->lock held. */ -int xen_set_affinity_evtchn(struct irq_desc *desc, unsigned int tcpu) -{ - struct irq_data *d = irq_desc_get_irq_data(desc); - - return set_affinity_irq(d, cpumask_of(tcpu), false); -} -EXPORT_SYMBOL_GPL(xen_set_affinity_evtchn); - static void enable_dynirq(struct irq_data *data) { evtchn_port_t evtchn = evtchn_from_irq(data->irq); --- a/drivers/xen/evtchn.c +++ b/drivers/xen/evtchn.c @@ -421,36 +421,6 @@ static void evtchn_unbind_from_user(stru del_evtchn(u, evtchn); } -static DEFINE_PER_CPU(int, bind_last_selected_cpu); - -static void evtchn_bind_interdom_next_vcpu(evtchn_port_t evtchn) -{ - unsigned int selected_cpu, irq; - struct irq_desc *desc; - unsigned long flags; - - irq = irq_from_evtchn(evtchn); - desc = irq_to_desc(irq); - - if (!desc) - return; - - raw_spin_lock_irqsave(&desc->lock, flags); - selected_cpu = this_cpu_read(bind_last_selected_cpu); - selected_cpu = cpumask_next_and(selected_cpu, - desc->irq_common_data.affinity, cpu_online_mask); - - if (unlikely(selected_cpu >= nr_cpu_ids)) - selected_cpu = cpumask_first_and(desc->irq_common_data.affinity, - cpu_online_mask); - - this_cpu_write(bind_last_selected_cpu, selected_cpu); - - /* unmask expects irqs to be disabled */ - xen_set_affinity_evtchn(desc, selected_cpu); - raw_spin_unlock_irqrestore(&desc->lock, flags); -} - static long evtchn_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -508,10 +478,8 @@ static long evtchn_ioctl(struct file *fi break; rc = evtchn_bind_to_user(u, bind_interdomain.local_port); - if (rc == 0) { + if (rc == 0) rc = bind_interdomain.local_port; - evtchn_bind_interdom_next_vcpu(rc); - } break; }
[patch 26/30] xen/events: Use immediate affinity setting
There is absolutely no reason to mimic the x86 deferred affinity setting. This mechanism is required to handle the hardware induced issues of IO/APIC and MSI and is not in use when the interrupts are remapped. XEN does not need this and can simply change the affinity from the calling context. The core code invokes this with the interrupt descriptor lock held so it is fully serialized against any other operation. Mark the interrupts with IRQ_MOVE_PCNTXT to disable the deferred affinity setting. The conditional mask/unmask operation is already handled in xen_rebind_evtchn_to_cpu(). This makes XEN on x86 use the same mechanics as on e.g. ARM64 where deferred affinity setting is not required and not implemented and the code path in the ack functions is compiled out. Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c | 35 +-- 1 file changed, 9 insertions(+), 26 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -628,6 +628,11 @@ static void xen_irq_init(unsigned irq) info->refcnt = -1; set_info_for_irq(irq, info); + /* +* Interrupt affinity setting can be immediate. No point +* in delaying it until an interrupt is handled. +*/ + irq_set_status_flags(irq, IRQ_MOVE_PCNTXT); INIT_LIST_HEAD(&info->eoi_list); list_add_tail(&info->list, &xen_irq_list_head); @@ -739,18 +744,7 @@ static void eoi_pirq(struct irq_data *da if (!VALID_EVTCHN(evtchn)) return; - if (unlikely(irqd_is_setaffinity_pending(data)) && - likely(!irqd_irq_disabled(data))) { - int masked = test_and_set_mask(evtchn); - - clear_evtchn(evtchn); - - irq_move_masked_irq(data); - - if (!masked) - unmask_evtchn(evtchn); - } else - clear_evtchn(evtchn); + clear_evtchn(evtchn); if (pirq_needs_eoi(data->irq)) { rc = HYPERVISOR_physdev_op(PHYSDEVOP_eoi, &eoi); @@ -1641,7 +1635,6 @@ void rebind_evtchn_irq(evtchn_port_t evt mutex_unlock(&irq_mapping_update_lock); bind_evtchn_to_cpu(evtchn, info->cpu); - /* This will be deferred until interrupt is processed */ irq_set_affinity(irq, cpumask_of(info->cpu)); /* Unmask the event channel. */ @@ -1688,8 +1681,9 @@ static int set_affinity_irq(struct irq_d bool force) { unsigned tcpu = cpumask_first_and(dest, cpu_online_mask); - int ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu); + int ret; + ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu); if (!ret) irq_data_update_effective_affinity(data, cpumask_of(tcpu)); @@ -1719,18 +1713,7 @@ static void ack_dynirq(struct irq_data * if (!VALID_EVTCHN(evtchn)) return; - if (unlikely(irqd_is_setaffinity_pending(data)) && - likely(!irqd_irq_disabled(data))) { - int masked = test_and_set_mask(evtchn); - - clear_evtchn(evtchn); - - irq_move_masked_irq(data); - - if (!masked) - unmask_evtchn(evtchn); - } else - clear_evtchn(evtchn); + clear_evtchn(evtchn); } static void mask_ack_dynirq(struct irq_data *data)
[patch 08/30] genirq: Provide kstat_irqdesc_cpu()
Most users of kstat_irqs_cpu() have the irq descriptor already. No point in calling into the core code and looking it up once more. Use it in per_cpu_count_show() to start with. Signed-off-by: Thomas Gleixner --- include/linux/irqdesc.h |6 ++ kernel/irq/irqdesc.c|4 ++-- 2 files changed, 8 insertions(+), 2 deletions(-) --- a/include/linux/irqdesc.h +++ b/include/linux/irqdesc.h @@ -113,6 +113,12 @@ static inline void irq_unlock_sparse(voi extern struct irq_desc irq_desc[NR_IRQS]; #endif +static inline unsigned int irq_desc_kstat_cpu(struct irq_desc *desc, + unsigned int cpu) +{ + return desc->kstat_irqs ? *per_cpu_ptr(desc->kstat_irqs, cpu) : 0; +} + static inline struct irq_desc *irq_data_to_desc(struct irq_data *data) { return container_of(data->common, struct irq_desc, irq_common_data); --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -147,12 +147,12 @@ static ssize_t per_cpu_count_show(struct struct kobj_attribute *attr, char *buf) { struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj); - int cpu, irq = desc->irq_data.irq; ssize_t ret = 0; char *p = ""; + int cpu; for_each_possible_cpu(cpu) { - unsigned int c = kstat_irqs_cpu(irq, cpu); + unsigned int c = irq_desc_kstat_cpu(desc, cpu); ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%s%u", p, c); p = ",";
[patch 11/30] parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts()
The irq descriptor is already there, no need to look it up again. Signed-off-by: Thomas Gleixner Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: afzal mohammed Cc: linux-par...@vger.kernel.org --- arch/parisc/kernel/irq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/parisc/kernel/irq.c +++ b/arch/parisc/kernel/irq.c @@ -218,7 +218,7 @@ int show_interrupts(struct seq_file *p, seq_printf(p, "%3d: ", i); for_each_online_cpu(j) - seq_printf(p, "%10u ", kstat_irqs_cpu(i, j)); + seq_printf(p, "%10u ", irq_desc_kstat_cpu(desc, j)); seq_printf(p, " %14s", irq_desc_get_chip(desc)->name); #ifndef PARISC_IRQ_CR16_COUNTS
[patch 10/30] arm64/smp: Use irq_desc_kstat_cpu() in arch_show_interrupts()
The irq descriptor is already there, no need to look it up again. Signed-off-by: Thomas Gleixner Cc: Mark Rutland Cc: Catalin Marinas Cc: Will Deacon Cc: Marc Zyngier Cc: linux-arm-ker...@lists.infradead.org --- arch/arm64/kernel/smp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -809,7 +809,7 @@ int arch_show_interrupts(struct seq_file seq_printf(p, "%*s%u:%s", prec - 1, "IPI", i, prec >= 4 ? " " : ""); for_each_online_cpu(cpu) - seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu)); + seq_printf(p, "%10u ", irq_desc_kstat_cpu(ipi_desc[i], cpu)); seq_printf(p, " %s\n", ipi_types[i]); }
[patch 12/30] s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt()
The irq descriptor is already there, no need to look it up again. Signed-off-by: Thomas Gleixner Cc: Christian Borntraeger Cc: Heiko Carstens Cc: linux-s...@vger.kernel.org --- arch/s390/kernel/irq.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/s390/kernel/irq.c +++ b/arch/s390/kernel/irq.c @@ -124,7 +124,7 @@ static void show_msi_interrupt(struct se raw_spin_lock_irqsave(&desc->lock, flags); seq_printf(p, "%3d: ", irq); for_each_online_cpu(cpu) - seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu)); + seq_printf(p, "%10u ", irq_desc_kstat_irq(desc, cpu)); if (desc->irq_data.chip) seq_printf(p, " %8s", desc->irq_data.chip->name);
[patch 13/30] drm/i915/lpe_audio: Remove pointless irq_to_desc() usage
Nothing uses the result and nothing should ever use it in driver code. Signed-off-by: Thomas Gleixner Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: David Airlie Cc: Daniel Vetter Cc: Pankaj Bharadiya Cc: Chris Wilson Cc: Wambui Karuga Cc: intel-...@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/i915/display/intel_lpe_audio.c |4 1 file changed, 4 deletions(-) --- a/drivers/gpu/drm/i915/display/intel_lpe_audio.c +++ b/drivers/gpu/drm/i915/display/intel_lpe_audio.c @@ -297,13 +297,9 @@ int intel_lpe_audio_init(struct drm_i915 */ void intel_lpe_audio_teardown(struct drm_i915_private *dev_priv) { - struct irq_desc *desc; - if (!HAS_LPE_AUDIO(dev_priv)) return; - desc = irq_to_desc(dev_priv->lpe_audio.irq); - lpe_audio_platdev_destroy(dev_priv); irq_free_desc(dev_priv->lpe_audio.irq);
[patch 19/30] PCI: mobiveil: Use irq_data_get_irq_chip_data()
Going through a full irq descriptor lookup instead of just using the proper helper function which provides direct access is suboptimal. In fact it _is_ wrong because the chip callback needs to get the chip data which is relevant for the chip while using the irq descriptor variant returns the irq chip data of the top level chip of a hierarchy. It does not matter in this case because the chip is the top level chip, but that doesn't make it more correct. Signed-off-by: Thomas Gleixner Cc: Karthikeyan Mitran Cc: Hou Zhiqiang Cc: Lorenzo Pieralisi Cc: Rob Herring Cc: Bjorn Helgaas Cc: linux-...@vger.kernel.org --- drivers/pci/controller/mobiveil/pcie-mobiveil-host.c |8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) --- a/drivers/pci/controller/mobiveil/pcie-mobiveil-host.c +++ b/drivers/pci/controller/mobiveil/pcie-mobiveil-host.c @@ -306,13 +306,11 @@ int mobiveil_host_init(struct mobiveil_p static void mobiveil_mask_intx_irq(struct irq_data *data) { - struct irq_desc *desc = irq_to_desc(data->irq); - struct mobiveil_pcie *pcie; + struct mobiveil_pcie *pcie = irq_data_get_irq_chip_data(data); struct mobiveil_root_port *rp; unsigned long flags; u32 mask, shifted_val; - pcie = irq_desc_get_chip_data(desc); rp = &pcie->rp; mask = 1 << ((data->hwirq + PAB_INTX_START) - 1); raw_spin_lock_irqsave(&rp->intx_mask_lock, flags); @@ -324,13 +322,11 @@ static void mobiveil_mask_intx_irq(struc static void mobiveil_unmask_intx_irq(struct irq_data *data) { - struct irq_desc *desc = irq_to_desc(data->irq); - struct mobiveil_pcie *pcie; + struct mobiveil_pcie *pcie = irq_data_get_irq_chip_data(data); struct mobiveil_root_port *rp; unsigned long flags; u32 shifted_val, mask; - pcie = irq_desc_get_chip_data(desc); rp = &pcie->rp; mask = 1 << ((data->hwirq + PAB_INTX_START) - 1); raw_spin_lock_irqsave(&rp->intx_mask_lock, flags);
[patch 16/30] mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc
First of all drivers have absolutely no business to dig into the internals of an irq descriptor. That's core code and subject to change. All of this information is readily available to /proc/interrupts in a safe and race free way. Remove the inspection code which is a blatant violation of subsystem boundaries and racy against concurrent modifications of the interrupt descriptor. Print the irq line instead so the information can be looked up in a sane way in /proc/interrupts. Signed-off-by: Thomas Gleixner Cc: Linus Walleij Cc: Lee Jones Cc: linux-arm-ker...@lists.infradead.org --- drivers/mfd/ab8500-debugfs.c | 16 +++- 1 file changed, 3 insertions(+), 13 deletions(-) --- a/drivers/mfd/ab8500-debugfs.c +++ b/drivers/mfd/ab8500-debugfs.c @@ -1513,24 +1513,14 @@ static int ab8500_interrupts_show(struct { int line; - seq_puts(s, "name: number: number of: wake:\n"); + seq_puts(s, "name: number: irq: number of: wake:\n"); for (line = 0; line < num_interrupt_lines; line++) { - struct irq_desc *desc = irq_to_desc(line + irq_first); - - seq_printf(s, "%3i: %6i %4i", + seq_printf(s, "%3i: %6i %4i %4i\n", line, + line + irq_first, num_interrupts[line], num_wake_interrupts[line]); - - if (desc && desc->name) - seq_printf(s, "-%-8s", desc->name); - if (desc && desc->action) { - struct irqaction *action = desc->action; - - seq_printf(s, " %s", action->name); - while ((action = action->next) != NULL) - seq_printf(s, ", %s", action->name); } seq_putc(s, '\n'); }
[patch 30/30] genirq: Remove export of irq_to_desc()
No more (ab)use in modules finally. Remove the export so there won't come new ones. Signed-off-by: Thomas Gleixner --- kernel/irq/irqdesc.c |1 - 1 file changed, 1 deletion(-) --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -352,7 +352,6 @@ struct irq_desc *irq_to_desc(unsigned in { return radix_tree_lookup(&irq_desc_tree, irq); } -EXPORT_SYMBOL(irq_to_desc); static void delete_irq_desc(unsigned int irq) {
[patch 15/30] pinctrl: nomadik: Use irq_has_action()
Let the core code do the fiddling with irq_desc. Signed-off-by: Thomas Gleixner Cc: Linus Walleij Cc: linux-arm-ker...@lists.infradead.org Cc: linux-g...@vger.kernel.org --- drivers/pinctrl/nomadik/pinctrl-nomadik.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) --- a/drivers/pinctrl/nomadik/pinctrl-nomadik.c +++ b/drivers/pinctrl/nomadik/pinctrl-nomadik.c @@ -948,7 +948,6 @@ static void nmk_gpio_dbg_show_one(struct (mode < 0) ? "unknown" : modes[mode]); } else { int irq = chip->to_irq(chip, offset); - struct irq_desc *desc = irq_to_desc(irq); const int pullidx = pull ? 1 : 0; int val; static const char * const pulls[] = { @@ -969,7 +968,7 @@ static void nmk_gpio_dbg_show_one(struct * This races with request_irq(), set_irq_type(), * and set_irq_wake() ... but those are "rare". */ - if (irq > 0 && desc && desc->action) { + if (irq > 0 && irq_has_action(irq)) { char *trigger; if (nmk_chip->edge_rising & BIT(offset))
[patch 20/30] net/mlx4: Replace irq_to_desc() abuse
No driver has any business with the internals of an interrupt descriptor. Storing a pointer to it just to use yet another helper at the actual usage site to retrieve the affinity mask is creative at best. Just because C does not allow encapsulation does not mean that the kernel has no limits. Retrieve a pointer to the affinity mask itself and use that. It's still using an interface which is usually not for random drivers, but definitely less hideous than the previous hack. Signed-off-by: Thomas Gleixner Cc: Tariq Toukan Cc: "David S. Miller" Cc: Jakub Kicinski Cc: netdev@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx4/en_cq.c |8 +++- drivers/net/ethernet/mellanox/mlx4/en_rx.c |6 +- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 ++- 3 files changed, 6 insertions(+), 11 deletions(-) --- a/drivers/net/ethernet/mellanox/mlx4/en_cq.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_cq.c @@ -90,7 +90,7 @@ int mlx4_en_activate_cq(struct mlx4_en_p int cq_idx) { struct mlx4_en_dev *mdev = priv->mdev; - int err = 0; + int irq, err = 0; int timestamp_en = 0; bool assigned_eq = false; @@ -116,10 +116,8 @@ int mlx4_en_activate_cq(struct mlx4_en_p assigned_eq = true; } - - cq->irq_desc = - irq_to_desc(mlx4_eq_get_irq(mdev->dev, - cq->vector)); + irq = mlx4_eq_get_irq(mdev->dev, cq->vector); + cq->aff_mask = irq_get_affinity_mask(irq); } else { /* For TX we use the same irq per ring we assigned for the RX*/ --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -959,8 +959,6 @@ int mlx4_en_poll_rx_cq(struct napi_struc /* If we used up all the quota - we're probably not done yet... */ if (done == budget || !clean_complete) { - const struct cpumask *aff; - struct irq_data *idata; int cpu_curr; /* in case we got here because of !clean_complete */ @@ -969,10 +967,8 @@ int mlx4_en_poll_rx_cq(struct napi_struc INC_PERF_COUNTER(priv->pstats.napi_quota); cpu_curr = smp_processor_id(); - idata = irq_desc_get_irq_data(cq->irq_desc); - aff = irq_data_get_affinity_mask(idata); - if (likely(cpumask_test_cpu(cpu_curr, aff))) + if (likely(cpumask_test_cpu(cpu_curr, cq->aff_mask))) return budget; /* Current cpu is not according to smp_irq_affinity - --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -46,6 +46,7 @@ #endif #include #include +#include #include #include @@ -380,7 +381,7 @@ struct mlx4_en_cq { struct mlx4_cqe *buf; #define MLX4_EN_OPCODE_ERROR 0x1e - struct irq_desc *irq_desc; + const struct cpumask *aff_mask; }; struct mlx4_en_port_profile {
[patch 23/30] net/mlx5: Use effective interrupt affinity
Using the interrupt affinity mask for checking locality is not really working well on architectures which support effective affinity masks. The affinity mask is either the system wide default or set by user space, but the architecture can or even must reduce the mask to the effective set, which means that checking the affinity mask itself does not really tell about the actual target CPUs. Signed-off-by: Thomas Gleixner Cc: Saeed Mahameed Cc: Leon Romanovsky Cc: "David S. Miller" Cc: Jakub Kicinski Cc: netdev@vger.kernel.org Cc: linux-r...@vger.kernel.org --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx c->num_tc = params->num_tc; c->xdp = !!params->xdp_prog; c->stats= &priv->channel_stats[ix].ch; - c->aff_mask = irq_get_affinity_mask(irq); + c->aff_mask = irq_get_effective_affinity_mask(irq); c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix); netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
[patch 22/30] net/mlx5: Replace irq_to_desc() abuse
No driver has any business with the internals of an interrupt descriptor. Storing a pointer to it just to use yet another helper at the actual usage site to retrieve the affinity mask is creative at best. Just because C does not allow encapsulation does not mean that the kernel has no limits. Retrieve a pointer to the affinity mask itself and use that. It's still using an interface which is usually not for random drivers, but definitely less hideous than the previous hack. Signed-off-by: Thomas Gleixner --- drivers/net/ethernet/mellanox/mlx5/core/en.h |2 +- drivers/net/ethernet/mellanox/mlx5/core/en_main.c |2 +- drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |6 +- 3 files changed, 3 insertions(+), 7 deletions(-) --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -669,7 +669,7 @@ struct mlx5e_channel { spinlock_t async_icosq_lock; /* data path - accessed per napi poll */ - struct irq_desc *irq_desc; + const struct cpumask *aff_mask; struct mlx5e_ch_stats *stats; /* control */ --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -1998,7 +1998,7 @@ static int mlx5e_open_channel(struct mlx c->num_tc = params->num_tc; c->xdp = !!params->xdp_prog; c->stats= &priv->channel_stats[ix].ch; - c->irq_desc = irq_to_desc(irq); + c->aff_mask = irq_get_affinity_mask(irq); c->lag_port = mlx5e_enumerate_lag_port(priv->mdev, ix); netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64); --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c @@ -40,12 +40,8 @@ static inline bool mlx5e_channel_no_affinity_change(struct mlx5e_channel *c) { int current_cpu = smp_processor_id(); - const struct cpumask *aff; - struct irq_data *idata; - idata = irq_desc_get_irq_data(c->irq_desc); - aff = irq_data_get_affinity_mask(idata); - return cpumask_test_cpu(current_cpu, aff); + return cpumask_test_cpu(current_cpu, c->aff_mask); } static void mlx5e_handle_tx_dim(struct mlx5e_txqsq *sq)
[patch 24/30] xen/events: Remove unused bind_evtchn_to_irq_lateeoi()
Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c |6 -- 1 file changed, 6 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1132,12 +1132,6 @@ int bind_evtchn_to_irq(evtchn_port_t evt } EXPORT_SYMBOL_GPL(bind_evtchn_to_irq); -int bind_evtchn_to_irq_lateeoi(evtchn_port_t evtchn) -{ - return bind_evtchn_to_irq_chip(evtchn, &xen_lateeoi_chip); -} -EXPORT_SYMBOL_GPL(bind_evtchn_to_irq_lateeoi); - static int bind_ipi_to_irq(unsigned int ipi, unsigned int cpu) { struct evtchn_bind_ipi bind_ipi;
[patch 17/30] NTB/msi: Use irq_has_action()
Use the proper core function. Signed-off-by: Thomas Gleixner Cc: Jon Mason Cc: Dave Jiang Cc: Allen Hubbe Cc: linux-...@googlegroups.com --- drivers/ntb/msi.c |4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) --- a/drivers/ntb/msi.c +++ b/drivers/ntb/msi.c @@ -282,15 +282,13 @@ int ntbm_msi_request_threaded_irq(struct struct ntb_msi_desc *msi_desc) { struct msi_desc *entry; - struct irq_desc *desc; int ret; if (!ntb->msi) return -EINVAL; for_each_pci_msi_entry(entry, ntb->pdev) { - desc = irq_to_desc(entry->irq); - if (desc->action) + if (irq_has_action(entry->irq)) continue; ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
[patch 28/30] xen/events: Reduce irq_info::spurious_cnt storage size
To prepare for interrupt spreading reduce the storage size of irq_info::spurious_cnt to u8 so the required flag for the spreading logic will not increase the storage size. Protect the usage site against overruns. Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c |8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -95,7 +95,7 @@ struct irq_info { struct list_head list; struct list_head eoi_list; short refcnt; - short spurious_cnt; + u8 spurious_cnt; enum xen_irq_type type; /* type */ unsigned irq; evtchn_port_t evtchn; /* event channel */ @@ -528,8 +528,10 @@ static void xen_irq_lateeoi_locked(struc return; if (spurious) { - if ((1 << info->spurious_cnt) < (HZ << 2)) - info->spurious_cnt++; + if ((1 << info->spurious_cnt) < (HZ << 2)) { + if (info->spurious_cnt != 0xFF) + info->spurious_cnt++; + } if (info->spurious_cnt > 1) { delay = 1 << (info->spurious_cnt - 2); if (delay > HZ)
[patch 29/30] xen/events: Implement irq distribution
Keep track of the assignments of event channels to CPUs and select the online CPU with the least assigned channels in the affinity mask which is handed to irq_chip::irq_set_affinity() from the core code. Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c | 72 ++- 1 file changed, 64 insertions(+), 8 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -96,6 +96,7 @@ struct irq_info { struct list_head eoi_list; short refcnt; u8 spurious_cnt; + u8 is_accounted; enum xen_irq_type type; /* type */ unsigned irq; evtchn_port_t evtchn; /* event channel */ @@ -161,6 +162,9 @@ static DEFINE_PER_CPU(int [NR_VIRQS], vi /* IRQ <-> IPI mapping */ static DEFINE_PER_CPU(int [XEN_NR_IPIS], ipi_to_irq) = {[0 ... XEN_NR_IPIS-1] = -1}; +/* Event channel distribution data */ +static atomic_t channels_on_cpu[NR_CPUS]; + static int **evtchn_to_irq; #ifdef CONFIG_X86 static unsigned long *pirq_eoi_map; @@ -257,6 +261,32 @@ static void set_info_for_irq(unsigned in irq_set_chip_data(irq, info); } +/* Per CPU channel accounting */ +static void channels_on_cpu_dec(struct irq_info *info) +{ + if (!info->is_accounted) + return; + + info->is_accounted = 0; + + if (WARN_ON_ONCE(info->cpu >= nr_cpu_ids)) + return; + + WARN_ON_ONCE(!atomic_add_unless(&channels_on_cpu[info->cpu], -1 , 0)); +} + +static void channels_on_cpu_inc(struct irq_info *info) +{ + if (WARN_ON_ONCE(info->cpu >= nr_cpu_ids)) + return; + + if (WARN_ON_ONCE(!atomic_add_unless(&channels_on_cpu[info->cpu], 1, + INT_MAX))) + return; + + info->is_accounted = 1; +} + /* Constructors for packed IRQ information. */ static int xen_irq_info_common_setup(struct irq_info *info, unsigned irq, @@ -339,6 +369,7 @@ static void xen_irq_info_cleanup(struct { set_evtchn_to_irq(info->evtchn, -1); info->evtchn = 0; + channels_on_cpu_dec(info); } /* @@ -449,7 +480,9 @@ static void bind_evtchn_to_cpu(evtchn_po xen_evtchn_port_bind_to_cpu(evtchn, cpu, info->cpu); + channels_on_cpu_dec(info); info->cpu = cpu; + channels_on_cpu_inc(info); } /** @@ -622,11 +655,6 @@ static void xen_irq_init(unsigned irq) { struct irq_info *info; -#ifdef CONFIG_SMP - /* By default all event channels notify CPU#0. */ - cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(0)); -#endif - info = kzalloc(sizeof(*info), GFP_KERNEL); if (info == NULL) panic("Unable to allocate metadata for IRQ%d\n", irq); @@ -1691,10 +1719,34 @@ static int xen_rebind_evtchn_to_cpu(evtc return 0; } +/* + * Find the CPU within @dest mask which has the least number of channels + * assigned. This is not precise as the per cpu counts can be modified + * concurrently. + */ +static unsigned int select_target_cpu(const struct cpumask *dest) +{ + unsigned int cpu, best_cpu = UINT_MAX, minch = UINT_MAX; + + for_each_cpu_and(cpu, dest, cpu_online_mask) { + unsigned int curch = atomic_read(&channels_on_cpu[cpu]); + + if (curch < minch) { + minch = curch; + best_cpu = cpu; + } + } + + /* If this happens accounting is screwed up */ + if (WARN_ON_ONCE(best_cpu == UINT_MAX)) + best_cpu = cpumask_first(dest); + return best_cpu; +} + static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest, bool force) { - unsigned tcpu = cpumask_first_and(dest, cpu_online_mask); + unsigned int tcpu = select_target_cpu(dest); int ret; ret = xen_rebind_evtchn_to_cpu(evtchn_from_irq(data->irq), tcpu); @@ -1922,8 +1974,12 @@ void xen_irq_resume(void) xen_evtchn_resume(); /* No IRQ <-> event-channel mappings. */ - list_for_each_entry(info, &xen_irq_list_head, list) - info->evtchn = 0; /* zap event-channel binding */ + list_for_each_entry(info, &xen_irq_list_head, list) { + /* Zap event-channel binding */ + info->evtchn = 0; + /* Adjust accounting */ + channels_on_cpu_dec(info); + } clear_evtchn_to_irq_all();
[patch 27/30] xen/events: Only force affinity mask for percpu interrupts
All event channel setups bind the interrupt on CPU0 or the target CPU for percpu interrupts and overwrite the affinity mask with the corresponding cpumask. That does not make sense. The XEN implementation of irqchip::irq_set_affinity() already picks a single target CPU out of the affinity mask and the actual target is stored in the effective CPU mask, so destroying the user chosen affinity mask which might contain more than one CPU is wrong. Change the implementation so that the channel is bound to CPU0 at the XEN level and leave the affinity mask alone. At startup of the interrupt affinity will be assigned out of the affinity mask and the XEN binding will be updated. Only keep the enforcement for real percpu interrupts. On resume the overwrite is not required either because info->cpu and the affinity mask are still the same as at the time of suspend. Same for rebind_evtchn_irq(). This also prepares for proper interrupt spreading. Signed-off-by: Thomas Gleixner Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org --- drivers/xen/events/events_base.c | 42 ++- 1 file changed, 28 insertions(+), 14 deletions(-) --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -433,15 +433,20 @@ static bool pirq_needs_eoi_flag(unsigned return info->u.pirq.flags & PIRQ_NEEDS_EOI; } -static void bind_evtchn_to_cpu(evtchn_port_t evtchn, unsigned int cpu) +static void bind_evtchn_to_cpu(evtchn_port_t evtchn, unsigned int cpu, + bool force_affinity) { int irq = get_evtchn_to_irq(evtchn); struct irq_info *info = info_for_irq(irq); BUG_ON(irq == -1); -#ifdef CONFIG_SMP - cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(cpu)); -#endif + + if (IS_ENABLED(CONFIG_SMP) && force_affinity) { + cpumask_copy(irq_get_affinity_mask(irq), cpumask_of(cpu)); + cpumask_copy(irq_get_effective_affinity_mask(irq), +cpumask_of(cpu)); + } + xen_evtchn_port_bind_to_cpu(evtchn, cpu, info->cpu); info->cpu = cpu; @@ -788,7 +793,7 @@ static unsigned int __startup_pirq(unsig goto err; info->evtchn = evtchn; - bind_evtchn_to_cpu(evtchn, 0); + bind_evtchn_to_cpu(evtchn, 0, false); rc = xen_evtchn_port_setup(evtchn); if (rc) @@ -1107,8 +1112,8 @@ static int bind_evtchn_to_irq_chip(evtch irq = ret; goto out; } - /* New interdomain events are bound to VCPU 0. */ - bind_evtchn_to_cpu(evtchn, 0); + /* New interdomain events are initially bound to VCPU 0. */ + bind_evtchn_to_cpu(evtchn, 0, false); } else { struct irq_info *info = info_for_irq(irq); WARN_ON(info == NULL || info->type != IRQT_EVTCHN); @@ -1156,7 +1161,11 @@ static int bind_ipi_to_irq(unsigned int irq = ret; goto out; } - bind_evtchn_to_cpu(evtchn, cpu); + /* +* Force the affinity mask to the target CPU so proc shows +* the correct target. +*/ + bind_evtchn_to_cpu(evtchn, cpu, true); } else { struct irq_info *info = info_for_irq(irq); WARN_ON(info == NULL || info->type != IRQT_IPI); @@ -1269,7 +1278,11 @@ int bind_virq_to_irq(unsigned int virq, goto out; } - bind_evtchn_to_cpu(evtchn, cpu); + /* +* Force the affinity mask for percpu interrupts so proc +* shows the correct target. +*/ + bind_evtchn_to_cpu(evtchn, cpu, percpu); } else { struct irq_info *info = info_for_irq(irq); WARN_ON(info == NULL || info->type != IRQT_VIRQ); @@ -1634,8 +1647,7 @@ void rebind_evtchn_irq(evtchn_port_t evt mutex_unlock(&irq_mapping_update_lock); -bind_evtchn_to_cpu(evtchn, info->cpu); - irq_set_affinity(irq, cpumask_of(info->cpu)); + bind_evtchn_to_cpu(evtchn, info->cpu, false); /* Unmask the event channel. */ enable_irq(irq); @@ -1669,7 +1681,7 @@ static int xen_rebind_evtchn_to_cpu(evtc * it, but don't do the xenlinux-level rebind in that case. */ if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_vcpu, &bind_vcpu) >= 0) - bind_evtchn_to_cpu(evtchn, tcpu); + bind_evtchn_to_cpu(evtchn, tcpu, false); if (!masked) unmask_evtchn(evtchn); @@ -1798,7 +1810,8 @@ static void restore_cpu_virqs(unsigned i /* Record the new mapping. */
[patch 09/30] ARM: smp: Use irq_desc_kstat_cpu() in show_ipi_list()
The irq descriptor is already there, no need to look it up again. Signed-off-by: Thomas Gleixner Cc: Marc Zyngier Cc: Russell King Cc: linux-arm-ker...@lists.infradead.org --- arch/arm/kernel/smp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -550,7 +550,7 @@ void show_ipi_list(struct seq_file *p, i seq_printf(p, "%*s%u: ", prec - 1, "IPI", i); for_each_online_cpu(cpu) - seq_printf(p, "%10u ", kstat_irqs_cpu(irq, cpu)); + seq_printf(p, "%10u ", irq_desc_kstat_cpu(ipi_desc[i], cpu)); seq_printf(p, " %s\n", ipi_types[i]); }
[patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy
Driver code has no business with the internals of the irq descriptor. Aside of that the count is per interrupt line and therefore takes interrupts from other devices into account which share the interrupt line and are not handled by the graphics driver. Replace it with a pmu private count which only counts interrupts which originate from the graphics card. To avoid atomics or heuristics of some sort make the counter field 'unsigned long'. That limits the count to 4e9 on 32bit which is a lot and postprocessing can easily deal with the occasional wraparound. Signed-off-by: Thomas Gleixner Cc: Tvrtko Ursulin Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: David Airlie Cc: Daniel Vetter Cc: intel-...@lists.freedesktop.org Cc: dri-de...@lists.freedesktop.org --- drivers/gpu/drm/i915/i915_irq.c | 34 ++ drivers/gpu/drm/i915/i915_pmu.c | 18 +- drivers/gpu/drm/i915/i915_pmu.h |8 3 files changed, 43 insertions(+), 17 deletions(-) --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -60,6 +60,24 @@ * and related files, but that will be described in separate chapters. */ +/* + * Interrupt statistic for PMU. Increments the counter only if the + * interrupt originated from the the GPU so interrupts from a device which + * shares the interrupt line are not accounted. + */ +static inline void pmu_irq_stats(struct drm_i915_private *priv, +irqreturn_t res) +{ + if (unlikely(res != IRQ_HANDLED)) + return; + + /* +* A clever compiler translates that into INC. A not so clever one +* should at least prevent store tearing. +*/ + WRITE_ONCE(priv->pmu.irq_count, priv->pmu.irq_count + 1); +} + typedef bool (*long_pulse_detect_func)(enum hpd_pin pin, u32 val); static const u32 hpd_ilk[HPD_NUM_PINS] = { @@ -1599,6 +1617,8 @@ static irqreturn_t valleyview_irq_handle valleyview_pipestat_irq_handler(dev_priv, pipe_stats); } while (0); + pmu_irq_stats(dev_priv, ret); + enable_rpm_wakeref_asserts(&dev_priv->runtime_pm); return ret; @@ -1676,6 +1696,8 @@ static irqreturn_t cherryview_irq_handle valleyview_pipestat_irq_handler(dev_priv, pipe_stats); } while (0); + pmu_irq_stats(dev_priv, ret); + enable_rpm_wakeref_asserts(&dev_priv->runtime_pm); return ret; @@ -2103,6 +2125,8 @@ static irqreturn_t ilk_irq_handler(int i if (sde_ier) raw_reg_write(regs, SDEIER, sde_ier); + pmu_irq_stats(i915, ret); + /* IRQs are synced during runtime_suspend, we don't require a wakeref */ enable_rpm_wakeref_asserts(&i915->runtime_pm); @@ -2419,6 +2443,8 @@ static irqreturn_t gen8_irq_handler(int gen8_master_intr_enable(regs); + pmu_irq_stats(dev_priv, IRQ_HANDLED); + return IRQ_HANDLED; } @@ -2514,6 +2540,8 @@ static __always_inline irqreturn_t gen11_gu_misc_irq_handler(gt, gu_misc_iir); + pmu_irq_stats(i915, IRQ_HANDLED); + return IRQ_HANDLED; } @@ -3688,6 +3716,8 @@ static irqreturn_t i8xx_irq_handler(int i8xx_pipestat_irq_handler(dev_priv, iir, pipe_stats); } while (0); + pmu_irq_stats(dev_priv, ret); + enable_rpm_wakeref_asserts(&dev_priv->runtime_pm); return ret; @@ -3796,6 +3826,8 @@ static irqreturn_t i915_irq_handler(int i915_pipestat_irq_handler(dev_priv, iir, pipe_stats); } while (0); + pmu_irq_stats(dev_priv, ret); + enable_rpm_wakeref_asserts(&dev_priv->runtime_pm); return ret; @@ -3941,6 +3973,8 @@ static irqreturn_t i965_irq_handler(int i965_pipestat_irq_handler(dev_priv, iir, pipe_stats); } while (0); + pmu_irq_stats(dev_priv, IRQ_HANDLED); + enable_rpm_wakeref_asserts(&dev_priv->runtime_pm); return ret; --- a/drivers/gpu/drm/i915/i915_pmu.c +++ b/drivers/gpu/drm/i915/i915_pmu.c @@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample( return HRTIMER_RESTART; } -static u64 count_interrupts(struct drm_i915_private *i915) -{ - /* open-coded kstat_irqs() */ - struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq); - u64 sum = 0; - int cpu; - - if (!desc || !desc->kstat_irqs) - return 0; - - for_each_possible_cpu(cpu) - sum += *per_cpu_ptr(desc->kstat_irqs, cpu); - - return sum; -} - static void i915_pmu_event_destroy(struct perf_event *event) { struct drm_i915_private *i915 = @@ -581,7 +565,7 @@ static u64 __i915_pmu_event_read(struct USEC_PER_SEC /* to MHz */); break; case I915_PMU_INTERRUPTS: - val = cou
[patch 05/30] genirq: Annotate irq stats data races
Both the per cpu stats and the accumulated count are accessed lockless and can be concurrently modified. That's intentional and the stats are a rough estimate anyway. Annotate them with data_race(). Signed-off-by: Thomas Gleixner --- kernel/irq/irqdesc.c |4 ++-- kernel/irq/proc.c|5 +++-- 2 files changed, 5 insertions(+), 4 deletions(-) --- a/kernel/irq/irqdesc.c +++ b/kernel/irq/irqdesc.c @@ -943,10 +943,10 @@ unsigned int kstat_irqs(unsigned int irq if (!irq_settings_is_per_cpu_devid(desc) && !irq_settings_is_per_cpu(desc) && !irq_is_nmi(desc)) - return desc->tot_count; + return data_race(desc->tot_count); for_each_possible_cpu(cpu) - sum += *per_cpu_ptr(desc->kstat_irqs, cpu); + sum += data_race(*per_cpu_ptr(desc->kstat_irqs, cpu)); return sum; } --- a/kernel/irq/proc.c +++ b/kernel/irq/proc.c @@ -488,9 +488,10 @@ int show_interrupts(struct seq_file *p, if (!desc || irq_settings_is_hidden(desc)) goto outsparse; - if (desc->kstat_irqs) + if (desc->kstat_irqs) { for_each_online_cpu(j) - any_count |= *per_cpu_ptr(desc->kstat_irqs, j); + any_count |= data_race(*per_cpu_ptr(desc->kstat_irqs, j)); + } if ((!desc->action || irq_desc_is_chained(desc)) && !any_count) goto outsparse;
[patch 04/30] genirq: Provide irq_get_effective_affinity()
Provide an accessor to the effective interrupt affinity mask. Going to be used to replace open coded fiddling with the irq descriptor. Signed-off-by: Thomas Gleixner --- include/linux/irq.h |7 +++ 1 file changed, 7 insertions(+) --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -907,6 +907,13 @@ struct cpumask *irq_data_get_effective_a } #endif +static inline struct cpumask *irq_get_effective_affinity_mask(unsigned int irq) +{ + struct irq_data *d = irq_get_irq_data(irq); + + return d ? irq_data_get_effective_affinity_mask(d) : NULL; +} + unsigned int arch_dynirq_lower_bound(unsigned int from); int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
[patch 00/30] genirq: Treewide hunt for irq descriptor abuse and assorted fixes
A recent request to export kstat_irqs() pointed to a copy of the same in the i915 code, which made me look for further usage of irq descriptors in drivers. The usage in drivers ranges from creative to broken in all colours. irqdesc.h clearly says that this is core functionality and the fact C does not allow full encapsulation is not a justification to fiddle with it just because. It took us a lot of effort to make the core functionality provide what drivers need. If there is a shortcoming, it's not asked too much to talk to the relevant maintainers instead of going off and fiddling with the guts of interrupt descriptors and often enough without understanding lifetime and locking rules. As people insist on not respecting boundaries, this series cleans up the (ab)use and at the end removes the export of irq_to_desc() to make it at least harder. All legitimate users of this are built in. While at it I stumbled over some other oddities related to interrupt counting and cleaned them up as well. The series applies on top of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core and is also available from git: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git genirq Thanks, tglx --- arch/alpha/kernel/sys_jensen.c |2 arch/arm/kernel/smp.c|2 arch/parisc/kernel/irq.c |7 arch/s390/kernel/irq.c |2 arch/x86/kernel/topology.c |1 arch/arm64/kernel/smp.c |2 drivers/gpu/drm/i915/display/intel_lpe_audio.c |4 drivers/gpu/drm/i915/i915_irq.c | 34 +++ drivers/gpu/drm/i915/i915_pmu.c | 18 - drivers/gpu/drm/i915/i915_pmu.h |8 drivers/mfd/ab8500-debugfs.c | 16 - drivers/net/ethernet/mellanox/mlx4/en_cq.c |8 drivers/net/ethernet/mellanox/mlx4/en_rx.c |6 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |3 drivers/net/ethernet/mellanox/mlx5/core/en.h |2 drivers/net/ethernet/mellanox/mlx5/core/en_main.c|2 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c|6 drivers/ntb/msi.c|4 drivers/pci/controller/mobiveil/pcie-mobiveil-host.c |8 drivers/pci/controller/pcie-xilinx-nwl.c |8 drivers/pinctrl/nomadik/pinctrl-nomadik.c|3 drivers/xen/events/events_base.c | 172 +++ drivers/xen/evtchn.c | 34 --- include/linux/interrupt.h|1 include/linux/irq.h |7 include/linux/irqdesc.h | 40 +--- include/linux/kernel_stat.h |1 kernel/irq/irqdesc.c | 42 ++-- kernel/irq/manage.c | 37 kernel/irq/proc.c|5 30 files changed, 263 insertions(+), 222 deletions(-)
Re: [PATCH v5 0/9] "Task_isolation" mode
Pavel, On Sat, Dec 05 2020 at 21:40, Pavel Machek wrote: > So... what kind of guarantees does this aim to provide / what tasks it > is useful for? > > For real time response, we have other approaches. Depends on your requirements. Some problems are actually better solved with busy polling. See below. > If you want to guarantee performnace of the "isolated" task... I don't > see how that works. Other tasks on the system still compete for DRAM > bandwidth, caches, etc... Applications which want to run as undisturbed as possible. There is quite a range of those: - Hardware in the loop simulation is today often done with that crude approach of "offlining" a CPU and then instead of playing dead jumping to a preloaded bare metal executable. That's a horrible hack and impossible to debug, but gives them the results they need to achieve. These applications are well optimized vs. cache and memory foot print, so they don't worry about these things too much and they surely don't run on SMI and BIOS value add inflicted machines. Don't even think about waiting for an interrupt to achieve what these folks are doing. So no, there are problems which a general purpose realtime OS cannot solve ever. - HPC computations on large data sets. While the memory foot print is large the access patterns are cache optimized. The problem there is that any unnecessary IPI, tick interrupt or whatever nuisance is disturbing the carefully optimized cache usage and alone getting rid of the timer interrupt gained them measurable performance. Even very low single digit percentage of runtime saving is valuable for these folks because the compute time on such beasts is expensive. - Realtime guests in KVM. With posted interrupts and a fully populated host side page table there is no point in running host side interrupts or IPIs for random accounting or whatever purposes as they affect the latency in the guest. With all the side effects mitigated and a properly set up guest and host it is possible to get to a zero exit situation after the bootup phase which means pretty much matching bare metal behaviour. Yes, you can do that with e.g. Jailhouse as well, but you lose lots of the fancy things KVM provides. And people care about these not just because they are fancy. They care because their application scenario needs them. There are more reasons why people want to be able to get as much isolation from the OS as possible but at the same time have a sane execution environment, debugging, performance monitoring and the OS provided protection mechanisms instead of horrible hacks. Isolation makes sense for a range of applications and there is no reason why Linux should not support them. Thanks, tglx
Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()
On Mon, Nov 23 2020 at 17:58, Alex Belits wrote: > From: Yuri Norov > > For nohz_full CPUs the desirable behavior is to receive interrupts > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's > obviously not desirable because it breaks isolation. > > This patch adds check for it. git grep 'This patch' Documentation/process/ > */ > void tick_nohz_full_kick_cpu(int cpu) > { > - if (!tick_nohz_full_cpu(cpu)) > + smp_rmb(); Undocumented smp_rmb() ... > + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu)) > return; I still have to see a convincing argument why task isolation is special and not just a straight forward extension of NOHZ full cpu isolation. It's not special as much as you want it to be special. Thanks, tglx
Re: [PATCH v5 4/9] task_isolation: Add task isolation hooks to arch-independent code
Alex, On Mon, Nov 23 2020 at 17:57, Alex Belits wrote: > Kernel entry and exit functions for task isolation are added to context > tracking and common entry points. Common handling of pending work on exit > to userspace now processes isolation breaking, cleanup and start. Again: You fail to explain the rationale and just explain what the patch is doing. I can see what the patch is doing from the patch itself. > --- > include/linux/hardirq.h | 2 ++ > include/linux/sched.h | 2 ++ > kernel/context_tracking.c | 5 + > kernel/entry/common.c | 10 +- > kernel/irq/irqdesc.c | 5 + At least 3 different subsystems, which means this again failed to be split into seperate patches. > extern void synchronize_irq(unsigned int irq); > @@ -115,6 +116,7 @@ extern void rcu_nmi_exit(void); > do {\ > lockdep_off(); \ > arch_nmi_enter(); \ > + task_isolation_kernel_enter(); \ Where is the explanation why this is safe and correct vs. this fragile code path? > @@ -1762,6 +1763,7 @@ extern char *__get_task_comm(char *to, size_t len, > struct task_struct *tsk); > #ifdef CONFIG_SMP > static __always_inline void scheduler_ipi(void) > { > + task_isolation_kernel_enter(); Why is the scheduler_ipi() special? Just because everything else cannot happen at all? Oh well... > #define CREATE_TRACE_POINTS > #include > @@ -100,6 +101,8 @@ void noinstr __context_tracking_enter(enum ctx_state > state) > __this_cpu_write(context_tracking.state, state); > } > context_tracking_recursion_exit(); > + > + task_isolation_exit_to_user_mode(); Why is this here at all and why is it outside of the recursion protection > } > EXPORT_SYMBOL_GPL(__context_tracking_enter); > > @@ -148,6 +151,8 @@ void noinstr __context_tracking_exit(enum ctx_state state) > if (!context_tracking_recursion_enter()) > return; > > + task_isolation_kernel_enter(); while this is inside? And why has the scheduler_ipi() on x86 call this twice? Just because? > if (__this_cpu_read(context_tracking.state) == state) { > if (__this_cpu_read(context_tracking.active)) { > /* > diff --git a/kernel/entry/common.c b/kernel/entry/common.c > static void exit_to_user_mode_prepare(struct pt_regs *regs) > { > - unsigned long ti_work = READ_ONCE(current_thread_info()->flags); > + unsigned long ti_work; > > lockdep_assert_irqs_disabled(); > > + task_isolation_before_pending_work_check(); > + > + ti_work = READ_ONCE(current_thread_info()->flags); > + > if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK)) > ti_work = exit_to_user_mode_loop(regs, ti_work); > > + if (unlikely(ti_work & _TIF_TASK_ISOLATION)) > + task_isolation_start(); Where is the explaination of this change? Aside of that how does anything of this compile on x86 at all? Answer: It does not ... Stop this frenzy right now. It's going nowhere and all you achieve is to make people more grumpy than they are already. Thanks, tglx
Re: [PATCH v5 3/9] task_isolation: userspace hard isolation from kernel
Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > .../admin-guide/kernel-parameters.txt | 6 + > drivers/base/cpu.c| 23 + > include/linux/hrtimer.h | 4 + > include/linux/isolation.h | 326 > include/linux/sched.h | 5 + > include/linux/tick.h | 3 + > include/uapi/linux/prctl.h| 6 + > init/Kconfig | 27 + > kernel/Makefile | 2 + > kernel/isolation.c| 714 ++ > kernel/signal.c | 2 + > kernel/sys.c | 6 + > kernel/time/hrtimer.c | 27 + > kernel/time/tick-sched.c | 18 + I asked you before to split this up into bits and pieces and argue and justify each change. Throwing this wholesale over the fence is going nowhere. It's not revieable at all. Aside of that ignoring review comments is a sure path to make yourself ignored: > +/* > + * Logging > + */ > +int task_isolation_message(int cpu, int level, bool supp, const char *fmt, > ...); > + > +#define pr_task_isol_emerg(cpu, fmt, ...)\ > + task_isolation_message(cpu, LOGLEVEL_EMERG, false, fmt, ##__VA_ARGS__) The comments various people made about that are not going away and none of this is going near anything I'm responsible for unless you provide these independent of the rest and with a reasonable justification why you can't use any other existing mechanism or extend it for your use case. Thanks, tglx
Re: [PATCH v5 2/9] task_isolation: vmstat: add vmstat_idle function
On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: > This function checks to see if a vmstat worker is not running, > and the vmstat diffs don't require an update. The function is > called from the task-isolation code to see if we need to > actually do some work to quiet vmstat. A changelog has to explain _WHY_ this change is necessary and not _WHAT_ the patch is doing. Thanks, tglx
Re: [PATCH v5 1/9] task_isolation: vmstat: add quiet_vmstat_sync function
Alex, On Mon, Nov 23 2020 at 17:56, Alex Belits wrote: why are you insisting on adding 'task_isolation: ' as prefix to every single patch? That's wrong as I explained before. The prefix denotes the affected subsystem and 'task_isolation' is _NOT_ a subsystem. It's the project name you are using but the affected code belongs to the memory management subsystem and if you run git log mm/vmstat.c you might get a hint what the proper prefix is, i.e. 'mm/vmstat: ' > In commit f01f17d3705b ("mm, vmstat: make quiet_vmstat lighter") > the quiet_vmstat() function became asynchronous, in the sense that > the vmstat work was still scheduled to run on the core when the > function returned. For task isolation, we need a synchronous This changelog is useless because how should someone not familiar with the term 'task isolation' figure out what that means? It's not the reviewers job to figure that out. Again: Go read and adhere to Documentation/process/* Aside of that your patches are CR/LF inflicted. Please fix your work flow and tools. Thanks, tglx
Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread
On Mon, Nov 09 2020 at 20:59, Ira Weiny wrote: > On Tue, Nov 10, 2020 at 02:13:56AM +0100, Thomas Gleixner wrote: > Also, we can convert the new memcpy_*_page() calls to kmap_local() as well. > [For now my patch just uses kmap_atomic().] > > I've not looked at all of the patches in your latest version. Have you > included converting any of the kmap() call sites? I thought you were more > focused on converting the kmap_atomic() to kmap_local()? I did not touch any of those yet, but it's a logical consequence to convert all kmap() instances which are _not_ creating a global mapping over to it. Thanks, tglx
Re: [PATCH RFC PKS/PMEM 05/58] kmap: Introduce k[un]map_thread
Ira, On Fri, Oct 09 2020 at 12:49, ira weiny wrote: > From: Ira Weiny > > To correctly support the semantics of kmap() with Kernel protection keys > (PKS), kmap() may be required to set the protections on multiple > processors (globally). Enabling PKS globally can be very expensive > depending on the requested operation. Furthermore, enabling a domain > globally reduces the protection afforded by PKS. > > Most kmap() (Aprox 209 of 229) callers use the map within a single thread and > have no need for the protection domain to be enabled globally. However, the > remaining callers do not follow this pattern and, as best I can tell, expect > the mapping to be 'global' and available to any thread who may access the > mapping.[1] > > We don't anticipate global mappings to pmem, however in general there is a > danger in changing the semantics of kmap(). Effectively, this would cause an > unresolved page fault with little to no information about why the failure > occurred. > > To resolve this a number of options were considered. > > 1) Attempt to change all the thread local kmap() calls to kmap_atomic()[2] > 2) Introduce a flags parameter to kmap() to indicate if the mapping should be >global or not > 3) Change ~20 call sites to 'kmap_global()' to indicate that they require a >global enablement of the pages. > 4) Change ~209 call sites to 'kmap_thread()' to indicate that the mapping is > to >be used within that thread of execution only > > Option 1 is simply not feasible. Option 2 would require all of the call sites > of kmap() to change. Option 3 seems like a good minimal change but there is a > danger that new code may miss the semantic change of kmap() and not get the > behavior the developer intended. Therefore, #4 was chosen. There is Option #5: Convert the thread local kmap() invocations to the proposed kmap_local() interface which is coming along [1]. That solves a couple of issues: 1) It relieves the current kmap_atomic() usage sites from the implict pagefault/preempt disable semantics which apply even when CONFIG_HIGHMEM is disabled. kmap_local() still can be invoked from atomic context. 2) Due to #1 it allows to replace the conditional usage of kmap() and kmap_atomic() for purely thread local mappings. 3) It puts the burden on the HIGHMEM inflicted systems 4) It is actually more efficient for most of the pure thread local use cases on HIGHMEM inflicted systems because it avoids the overhead of the global lock and the potential kmap slot exhaustion. A potential preemption will be more expensive, but that's not really the case we want to optimize for. 5) It solves the RT issue vs. kmap_atomic() So instead of creating yet another variety of kmap() which is just scratching the particular PKRS itch, can we please consolidate all of that on the wider reaching kmap_local() approach? Thanks, tglx [1] https://lore.kernel.org/lkml/20201103092712.714480...@linutronix.de/
Re: [PATCH net] r8169: fix operation under forced interrupt threading
On Thu, Oct 29 2020 at 11:19, Heiner Kallweit wrote: > On 29.10.2020 10:42, Thomas Gleixner wrote: > Correct, just that the legacy PCI interrupt scenario doesn't affect old > systems/devices only. Users may run the system with nomsi for > whatever reason and we need to be prepared. > > We could add handling for (pcidev->msi_enabled || pcidev->msix_enabled), > but this would look somewhat hacky to me. Well, there are quite some drivers which differentiate between MSI and legacy interrupts, most of them because MSI allows them to split handlers. So it's not completely insane. Thanks, tglx
Re: [PATCH net] r8169: fix operation under forced interrupt threading
On Thu, Oct 29 2020 at 09:42, Heiner Kallweit wrote: > On 29.10.2020 00:29, Jakub Kicinski wrote: >> Other handles may take spin_locks, which will sleep on RT. >> >> I guess we may need to switch away from the _irqoff() variant for >> drivers with IRQF_SHARED after all :( >> > Right. Unfortunately that's a large number of drivers, > e.g. pci_request_irq() sets IRQF_SHARED in general. IRQF_SHARED is not the problem. It only becomes a problem when the interrupt is actually shared which is only the case with the legacy PCI interrupt. MSI[X] is not affected at all. > But at least for now there doesn't seem to be a better way to deal > with the challenges imposed by forced threading and shared irqs. We still can do the static key trick, though it's admittedly hacky. Thanks, tglx
Re: [PATCH net] r8169: fix operation under forced interrupt threading
On Wed, Oct 28 2020 at 13:17, Heiner Kallweit wrote: > On 28.10.2020 12:43, Serge Belyshev wrote: >>> For several network drivers it was reported that using >>> __napi_schedule_irqoff() is unsafe with forced threading. One way to >>> fix this is switching back to __napi_schedule, but then we lose the >>> benefit of the irqoff version in general. As stated by Eric it doesn't >>> make sense to make the minimal hard irq handlers in drivers using NAPI >>> a thread. Therefore ensure that the hard irq handler is never >>> thread-ified. >> Hi! This patch actually breaks r8169 with threadirqs on an old box >> where it was working before: >> >> [0.00] DMI: Gigabyte Technology Co., Ltd. >> GA-MA790FX-DQ6/GA-MA790FX-DQ6, BIOS F7g 07/19/2010 >> ... >> [1.072676] r8169 :02:00.0 eth0: RTL8168b/8111b, 00:1a:4d:5d:6b:c3, >> XID 380, IRQ 18 >> ... >> [8.850099] genirq: Flags mismatch irq 18. 00010080 (eth0) vs. 2080 >> (ahci[:05:00.0]) >> >> (error is reported to userspace, interface failed to bring up). >> Reverting the patch fixes the problem. >> > Thanks for the report. On this old chip version MSI is unreliable, > therefore r8169 falls back to a PCI legacy interrupt. On your system > this PCI legacy interrupt seems to be shared between network and > disk. Then the IRQ core tries to threadify the disk interrupt > (setting IRQF_ONESHOT), whilst the network interrupt doesn't have > this flag set. This results in the flag mismatch error. > > Maybe, if one source of a shared interrupt doesn't allow forced > threading, this should be applied to the other sources too. > But this would require a change in the IRQ core, therefore > +Thomas to get his opinion on the issue. It's pretty simple. There is no way to fix that at the core level. Shared interrupts suck and to make them work halfways correct the sharing devices must have matching and non-competing flags. Especially for threaded vs. non threaded case. Shared interrupts are level triggered. So you have a conflict of interests: The threaded handler requires that the interrupt line is masked until the thread has completed otherwise the system will suffer an interrupt storm. The non-threaded want's it to be unmasked after it finished. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 16:08, Jacob Keller wrote: > On 10/26/2020 3:49 PM, Thomas Gleixner wrote: >> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote: >>> I don't think there is currently a way to control the >>> enablement/disablement of >>> interrupts from the userspace. >> >> You cannot just disable the interrupt. You need to make sure that the >> associated queue is shutdown or quiesced _before_ the interrupt is shut >> down. > > Could this be handled with a callback to the driver/hw? I know Intel HW > should support this type of quiesce/shutdown. We can't have a callback from the interrupt shutdown code as you have to wait for the queue to drain packets in flight. Something like this mark queue as going down (no more tx queueing) tell hardware not to route RX packets to it consume pending RX wait for already queued TX packets to be sent Look what the block people did. They have a common multi-instance hotplug state and they register each context (queue) as an instance. The hotplug core invokes the corresponding callbacks when bringing a CPU up or when shutting it down. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Tue, Oct 27 2020 at 08:47, Marcelo Tosatti wrote: > On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote: > However, if per-CPU interrupts are not disabled, then the (for example) > network device is free to include the CPU in its list of destinations. > Which would require one to say, configure RPS (or whatever mechanism > is distributing interrupts). And why is that a problem? If that's possible then you can prevent getting RX interrupts already today. > Hum, it would feel safer (rather than trust the #1 rule to be valid > in all cases) to ask the driver to disable the interrupt (after shutting > down queue) for that particular CPU. > > BTW, Thomas, software is free to configure a particular MSI-X interrupt > to point to any CPU: > > 10.11 MESSAGE SIGNALLED INTERRUPTS I know how MSI works :) > So taking the example where computation happens while isolated and later > stored via block interface, aren't we restricting the usage scenarios > by enforcing the "per-CPU queue has interrupt pointing to owner CPU" > rule? No. For block this is the ideal configuration (think locality) and it prevents vector exhaustion. If you make these interrupts freely routable then you bring back the vector exhaustion problem right away. Now we already established that networking has different requirements, so you have to come up with a different solution for it which allows to work for all use cases. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote: > On 10/26/20 5:50 PM, Thomas Gleixner wrote: >> But I still think that for curing that isolation stuff we want at least >> some information from the driver. Alternative solution would be to grant >> the allocation of interrupts and queues and have some sysfs knob to shut >> down queues at runtime. If that shutdown results in releasing the queue >> interrupt (via free_irq()) then the vector exhaustion problem goes away. > > I think this is close to what I and Marcelo were discussing earlier today > privately. > > I don't think there is currently a way to control the enablement/disablement > of > interrupts from the userspace. You cannot just disable the interrupt. You need to make sure that the associated queue is shutdown or quiesced _before_ the interrupt is shut down. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 15:13, Jakub Kicinski wrote: > On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote: >> But I still think that for curing that isolation stuff we want at least >> some information from the driver. Alternative solution would be to grant >> the allocation of interrupts and queues and have some sysfs knob to shut >> down queues at runtime. If that shutdown results in releasing the queue >> interrupt (via free_irq()) then the vector exhaustion problem goes away. >> >> Needs more thought and information (for network oblivious folks like >> me). > > One piece of information that may be useful is that even tho the RX > packets may be spread semi-randomly the user space can still control > which queues are included in the mechanism. There is an indirection > table in the HW which allows to weigh queues differently, or exclude > selected queues from the spreading. Other mechanisms exist to filter > flows onto specific queues. > > IOW just because a core has an queue/interrupt does not mean that > interrupt will ever fire, provided its excluded from RSS. > > Another piece is that by default we suggest drivers allocate 8 RX > queues, and online_cpus TX queues. The number of queues can be > independently controlled via ethtool -L. Drivers which can't support > separate queues will default to online_cpus queue pairs, and let > ethtool -L only set the "combined" parameter. > > There are drivers which always allocate online_cpus interrupts, > and then some of them will go unused if #qs < #cpus. Thanks for the enlightment. > My unpopular opinion is that for networking devices all the heuristics > we may come up with are going to be a dead end. I agree. Heuristics suck. > We need an explicit API to allow users placing queues on cores, and > use managed IRQs for data queues. (I'm assuming that managed IRQs will > let us reliably map a MSI-X vector to a core :)) Yes, they allow you to do that. That will need some tweaks to theway they work today (coming from the strict block mq semantics). You also need to be aware that managed irqs have also strict semantics vs. CPU hotplug. If the last CPU in the managed affinity set goes down then the interrupt is shut down by the irq core which means that you need to quiesce the associated queue before that happens. When the first CPU comes online again the interrupt is reenabled, so the queue should be able to handle it or has ensured that the device does not raise one before it is able to do so. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote: > On 10/26/2020 1:11 PM, Thomas Gleixner wrote: >> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: >>> Are there drivers which use more than one interrupt per queue? I know >>> drivers have multiple management interrupts.. and I guess some drivers >>> do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to >>> have multiple queues for one interrupt .. I'm not sure how a single >>> queue with multiple interrupts would work though. >> >> For block there is always one interrupt per queue. Some Network drivers >> seem to have seperate RX and TX interrupts per queue. > That's true when thinking of Tx and Rx as a single queue. Another way to > think about it is "one rx queue" and "one tx queue" each with their own > interrupt... > > Even if there are devices which force there to be exactly queue pairs, > you could still think of them as separate entities? Interesting thought. But as Jakub explained networking queues are fundamentally different from block queues on the RX side. For block the request issued on queue X will raise the complete interrupt on queue X. For networking the TX side will raise the TX interrupt on the queue on which the packet was queued obviously or should I say hopefully. :) But incoming packets will be directed to some receive queue based on a hash or whatever crystallball logic the firmware decided to implement. Which makes this not really suitable for the managed interrupt and spreading approach which is used by block-mq. Hrm... But I still think that for curing that isolation stuff we want at least some information from the driver. Alternative solution would be to grant the allocation of interrupts and queues and have some sysfs knob to shut down queues at runtime. If that shutdown results in releasing the queue interrupt (via free_irq()) then the vector exhaustion problem goes away. Needs more thought and information (for network oblivious folks like me). Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote: > On 10/26/2020 12:00 PM, Thomas Gleixner wrote: >> How does userspace know about the driver internals? Number of management >> interrupts, optimal number of interrupts per queue? > > I guess this is the problem solved in part by the queue management work > that would make queues a thing that userspace is aware of. > > Are there drivers which use more than one interrupt per queue? I know > drivers have multiple management interrupts.. and I guess some drivers > do combined 1 interrupt per pair of Tx/Rx.. It's also plausible to to > have multiple queues for one interrupt .. I'm not sure how a single > queue with multiple interrupts would work though. For block there is always one interrupt per queue. Some Network drivers seem to have seperate RX and TX interrupts per queue. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote: > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote: >> So without information from the driver which tells what the best number >> of interrupts is with a reduced number of CPUs, this cutoff will cause >> more problems than it solves. Regressions guaranteed. > > One might want to move from one interrupt per isolated app core > to zero, or vice versa. It seems that "best number of interrupts > is with reduced number of CPUs" information, is therefore in userspace, > not in driver... How does userspace know about the driver internals? Number of management interrupts, optimal number of interrupts per queue? >> Managed interrupts base their interrupt allocation and spreading on >> information which is handed in by the individual driver and not on crude >> assumptions. They are not imposing restrictions on the use case. >> >> It's perfectly fine for isolated work to save a data set to disk after >> computation has finished and that just works with the per-cpu I/O queue >> which is otherwise completely silent. > > Userspace could only change the mask of interrupts which are not > triggered by requests from the local CPU (admin, error, mgmt, etc), > to avoid the vector exhaustion problem. > > However, there is no explicit way for userspace to know that, as far as > i know. > > 130: 34845 0 0 0 0 0 >0 0 IR-PCI-MSI 33554433-edge nvme0q1 > 131: 0 27062 0 0 0 0 >0 0 IR-PCI-MSI 33554434-edge nvme0q2 > 132: 0 0 24393 0 0 0 >0 0 IR-PCI-MSI 33554435-edge nvme0q3 > 133: 0 0 0 24313 0 0 >0 0 IR-PCI-MSI 33554436-edge nvme0q4 > 134: 0 0 0 0 20608 0 >0 0 IR-PCI-MSI 33554437-edge nvme0q5 > 135: 0 0 0 0 0 22163 >0 0 IR-PCI-MSI 33554438-edge nvme0q6 > 136: 0 0 0 0 0 0 > 23020 0 IR-PCI-MSI 33554439-edge nvme0q7 > 137: 0 0 0 0 0 0 >0 24285 IR-PCI-MSI 33554440-edge nvme0q8 > > Can that be retrieved from PCI-MSI information, or drivers > have to inform this? The driver should use a different name for the admin queues. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Oct 26 2020 at 09:35, Nitesh Narayan Lal wrote: > On 10/23/20 5:00 PM, Thomas Gleixner wrote: >> An isolated setup, which I'm familiar with, has two housekeeping >> CPUs. So far I restricted the number of network queues with a module >> argument to two, which allocates two management interrupts for the >> device and two interrupts (RX/TX) per queue, i.e. a total of six. > > Does it somehow take num_online_cpus() into consideration while deciding > the number of interrupts to create? No, I just tell it to create two queues :) >> So without information from the driver which tells what the best number >> of interrupts is with a reduced number of CPUs, this cutoff will cause >> more problems than it solves. Regressions guaranteed. > > Indeed. > I think one commonality among the drivers at the moment is the usage of > num_online_cpus() to determine the vectors to create. > > So, maybe instead of doing this kind of restrictions in a generic level > API, it will make more sense to do this on a per-device basis by replacing > the number of online CPUs with the housekeeping CPUs? > > This is what I have done in the i40e patch. > But that still sounds hackish and will impact the performance. You want an interface which allows the driver to say: I need N interrupts for general management and ideally M interrupts per queue. This is similar to the way drivers tell the core code about their requirements for managed interrupts for the spreading calculation. >> Managed interrupts base their interrupt allocation and spreading on >> information which is handed in by the individual driver and not on crude >> assumptions. They are not imposing restrictions on the use case. > > Right, FWIU it is irq_do_set_affinity that prevents the spreading of > managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled, > isn't? No. Spreading takes possible CPUs into account. HK_FLAG_MANAGED_IRQ does not influence spreading at all. It only handles the case that an interrupt is affine to more than one CPUs and the resulting affinity mask spawns both housekeeping and isolated CPUs. It then steers the interrupt to the housekeeping CPUs (as long as there is one online). Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote: > On 10/23/20 4:58 AM, Peter Zijlstra wrote: >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote: >> So shouldn't we then fix the drivers / interface first, to get rid of >> this inconsistency? >> > Considering we agree that excess vector is a problem that needs to be > solved across all the drivers and that you are comfortable with the other > three patches in the set. If I may suggest the following: > > - We can pick those three patches for now, as that will atleast fix a > driver that is currently impacting RT workloads. Is that a fair > expectation? No. Blindly reducing the maximum vectors to the number of housekeeping CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide what the right number of interrupts for this situation is. Many of these drivers need more than queue interrupts, admin, error interrupt and some operate best with seperate RX/TX interrupts per queue. They all can "work" with a single PCI interrupt of course, but the price you pay is performance. An isolated setup, which I'm familiar with, has two housekeeping CPUs. So far I restricted the number of network queues with a module argument to two, which allocates two management interrupts for the device and two interrupts (RX/TX) per queue, i.e. a total of six. Now I reduced the number of available interrupts to two according to your hack, which makes it use one queue RX/TX combined and one management interrupt. Guess what happens? Network performance tanks to the points that it breaks a carefully crafted setup. The same applies to a device which is application specific and wants one channel including an interrupt per isolated application core. Today I can isolate 8 out of 12 CPUs and let the device create 8 channels and set one interrupt and channel affine to each isolated CPU. With your hack, I get only 4 interrupts and channels. Fail! You cannot declare that all this is perfectly fine, just because it does not matter for your particular use case. So without information from the driver which tells what the best number of interrupts is with a reduced number of CPUs, this cutoff will cause more problems than it solves. Regressions guaranteed. Managed interrupts base their interrupt allocation and spreading on information which is handed in by the individual driver and not on crude assumptions. They are not imposing restrictions on the use case. It's perfectly fine for isolated work to save a data set to disk after computation has finished and that just works with the per-cpu I/O queue which is otherwise completely silent. All isolated workers can do the same in parallel without trampling on each other toes by competing for a reduced number of queues which are affine to the housekeeper CPUs. Unfortunately network multi-queue is substantially different from block multi-queue (as I learned in this conversation), so the concept cannot be applied one-to-one to networking as is. But there are certainly part of it which can be reused. This needs a lot more thought than just these crude hacks. Especially under the aspect that there are talks about making isolation runtime switchable. Are you going to rmmod/insmod the i40e network driver to do so? That's going to work fine if you do that reconfiguration over network... Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Thu, Oct 22 2020 at 09:28, Marcelo Tosatti wrote: > On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote: >> The right answer to this is to utilize managed interrupts and have >> according logic in your network driver to handle CPU hotplug. When a CPU >> goes down, then the queue which is associated to that CPU is quiesced >> and the interrupt core shuts down the relevant interrupt instead of >> moving it to an online CPU (which causes the whole vector exhaustion >> problem on x86). When the CPU comes online again, then the interrupt is >> reenabled in the core and the driver reactivates the queue. > > Aha... But it would be necessary to do that from userspace (for runtime > isolate/unisolate). For anything which uses managed interrupts this is a non-problem and userspace has absolutely no business with it. Isolation does not shut down queues, at least not the block multi-queue ones which are only active when I/O is issued from that isolated CPU. So transitioning out of isolation requires no action at all. Transitioning in or changing the housekeeping mask needs some trivial tweak to handle the case where there is an overlap in the cpuset of a queue (housekeeping and isolated). This is handled already for setup and affinity changes, but of course not for runtime isolation mask changes, but that's a trivial thing to do. What's more interesting is how to deal with the network problem where there is no guarantee that the "response" ends up on the same queue as the "request" which is what the block people rely on. And that problem is not really an interrupt affinity problem in the first place. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote: > On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote: >> The right answer to this is to utilize managed interrupts and have >> according logic in your network driver to handle CPU hotplug. When a CPU >> goes down, then the queue which is associated to that CPU is quiesced >> and the interrupt core shuts down the relevant interrupt instead of >> moving it to an online CPU (which causes the whole vector exhaustion >> problem on x86). When the CPU comes online again, then the interrupt is >> reenabled in the core and the driver reactivates the queue. > > I think Mellanox folks made some forays into managed irqs, but I don't > remember/can't find the details now. > > For networking the locality / queue per core does not always work, > since the incoming traffic is usually spread based on a hash. Many That makes it problematic and is fundamentally different from block I/O. > applications perform better when network processing is done on a small > subset of CPUs, and application doesn't get interrupted every 100us. > So we do need extra user control here. Ok. > We have a bit of a uAPI problem since people had grown to depend on > IRQ == queue == NAPI to configure their systems. "The right way" out > would be a proper API which allows associating queues with CPUs rather > than IRQs, then we can use managed IRQs and solve many other problems. > > Such new API has been in the works / discussions for a while now. If there is anything which needs to be done/extended on the irq side please let me know. Thanks tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote: > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: >> However, IMHO we would still need a logic to prevent the devices from >> creating excess vectors. > > Managed interrupts are preventing exactly that by pinning the interrupts > and queues to one or a set of CPUs, which prevents vector exhaustion on > CPU hotplug. > > Non-managed, yes that is and always was a problem. One of the reasons > why managed interrupts exist. But why is this only a problem for isolation? The very same problem exists vs. CPU hotplug and therefore hibernation. On x86 we have at max. 204 vectors available for device interrupts per CPU. So assumed the only device interrupt in use is networking then any machine which has more than 204 network interrupts (queues, aux ...) active will prevent the machine from hibernation. Aside of that it's silly to have multiple queues targeted at a single CPU in case of hotplug. And that's not a theoretical problem. Some power management schemes shut down sockets when the utilization of a system is low enough, e.g. outside of working hours. The whole point of multi-queue is to have locality so that traffic from a CPU goes through the CPU local queue. What's the point of having two or more queues on a CPU in case of hotplug? The right answer to this is to utilize managed interrupts and have according logic in your network driver to handle CPU hotplug. When a CPU goes down, then the queue which is associated to that CPU is quiesced and the interrupt core shuts down the relevant interrupt instead of moving it to an online CPU (which causes the whole vector exhaustion problem on x86). When the CPU comes online again, then the interrupt is reenabled in the core and the driver reactivates the queue. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote: > On 10/20/20 10:16 AM, Thomas Gleixner wrote: >> With the above change this will result >> >>1 general interrupt which is free movable by user space >>1 managed interrupts (possible affinity to all 16 CPUs, but routed >> to housekeeping CPU as long as there is one online) >> >> So the device is now limited to a single queue which also affects the >> housekeeping CPUs because now they have to share a single queue. >> >> With larger machines this gets even worse. > > Yes, the change can impact the performance, however, if we don't do that we > may have a latency impact instead. Specifically, on larger systems where > most of the CPUs are isolated as we will definitely fail in moving all of the > IRQs away from the isolated CPUs to the housekeeping. For non managed interrupts I agree. >> So no. This needs way more thought for managed interrupts and you cannot >> do that at the PCI layer. > > Maybe we should not be doing anything in the case of managed IRQs as they > are anyways pinned to the housekeeping CPUs as long as we have the > 'managed_irq' option included in the kernel cmdline. Exactly. For the PCI side this vector limiting has to be restricted to the non managed case. >> Only the affinity spreading mechanism can do >> the right thing here. > > I can definitely explore this further. > > However, IMHO we would still need a logic to prevent the devices from > creating excess vectors. Managed interrupts are preventing exactly that by pinning the interrupts and queues to one or a set of CPUs, which prevents vector exhaustion on CPU hotplug. Non-managed, yes that is and always was a problem. One of the reasons why managed interrupts exist. Thanks, tglx
Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote: > > + hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ); > + > + /* > + * If we have isolated CPUs for use by real-time tasks, to keep the > + * latency overhead to a minimum, device-specific IRQ vectors are moved > + * to the housekeeping CPUs from the userspace by changing their > + * affinity mask. Limit the vector usage to keep housekeeping CPUs from > + * running out of IRQ vectors. > + */ This is not true for managed interrupts. The interrupts affinity of those cannot be changed by user space. > + if (hk_cpus < num_online_cpus()) { > + if (hk_cpus < min_vecs) > + max_vecs = min_vecs; > + else if (hk_cpus < max_vecs) > + max_vecs = hk_cpus; > + } So now with that assume a 16 core machine (HT off for simplicity) 17 Requested interrupts (1 general, 16 queues) Managed interrupts will allocate 1 general interrupt which is free movable by user space 16 managed interrupts for queues (one per CPU) This allows the driver to have 16 queues, i.e. one queue per CPU. These interrupts are only used when an application on a CPU issues I/O. With the above change this will result 1 general interrupt which is free movable by user space 1 managed interrupts (possible affinity to all 16 CPUs, but routed to housekeeping CPU as long as there is one online) So the device is now limited to a single queue which also affects the housekeeping CPUs because now they have to share a single queue. With larger machines this gets even worse. So no. This needs way more thought for managed interrupts and you cannot do that at the PCI layer. Only the affinity spreading mechanism can do the right thing here. Thanks, tglx
Re: Remove __napi_schedule_irqoff?
On Sun, Oct 18 2020 at 10:19, Jakub Kicinski wrote: > On Sun, 18 Oct 2020 10:20:41 +0200 Heiner Kallweit wrote: >> >> Otherwise a non-solution could be to make IRQ_FORCED_THREADING >> >> configurable. >> > >> > I have to say I do not understand why we want to defer to a thread the >> > hard IRQ that we use in NAPI model. >> > >> Seems like the current forced threading comes with the big hammer and >> thread-ifies all hard irq's. To avoid this all NAPI network drivers >> would have to request the interrupt with IRQF_NO_THREAD. In a !RT kernel, forced threading (via commandline option) is mostly a debug aid. It's pretty useful when something crashes in hard interrupt context which usually takes the whole machine down. It's rather unlikely to be used on production systems, and if so then the admin surely should know what he's doing. > Right, it'd work for some drivers. Other drivers try to take spin locks > in their IRQ handlers. I checked a few which do and some of these spinlocks just protect register access and are not used for more complex serialization. So these could be converted to raw spinlocks because their scope is short and limited. But yes, you are right that this might be an issue in general. > What gave me a pause was that we have a busy loop in napi_schedule_prep: > > bool napi_schedule_prep(struct napi_struct *n) > { > unsigned long val, new; > > do { > val = READ_ONCE(n->state); > if (unlikely(val & NAPIF_STATE_DISABLE)) > return false; > new = val | NAPIF_STATE_SCHED; > > /* Sets STATE_MISSED bit if STATE_SCHED was already set >* This was suggested by Alexander Duyck, as compiler >* emits better code than : >* if (val & NAPIF_STATE_SCHED) >* new |= NAPIF_STATE_MISSED; >*/ > new |= (val & NAPIF_STATE_SCHED) / NAPIF_STATE_SCHED * > NAPIF_STATE_MISSED; > } while (cmpxchg(&n->state, val, new) != val); > > return !(val & NAPIF_STATE_SCHED); > } > > > Dunno how acceptable this is to run in an IRQ handler on RT.. In theory it's bad, but I don't think it's a big deal in reality. Thanks, tglx
Re: Remove __napi_schedule_irqoff?
Jakub, On Sat, Oct 17 2020 at 16:29, Jakub Kicinski wrote: > On Sat, 17 Oct 2020 15:45:57 +0200 Heiner Kallweit wrote: >> It turned out that this most of the time isn't safe in certain >> configurations: >> - if CONFIG_PREEMPT_RT is set >> - if command line parameter threadirqs is set >> >> Having said that drivers are being switched back to __napi_schedule(), >> see e.g. patch in [0] and related discussion. I thought about a >> __napi_schedule version checking dynamically whether interrupts are >> disabled. But checking e.g. variable force_irqthreads also comes at >> a cost, so that we may not see a benefit compared to calling >> local_irq_save/local_irq_restore. This does not have to be a variable check. It's trivial enough to make it a static key. >> If more or less all users have to switch back, then the question is >> whether we should remove __napi_schedule_irqoff. >> Instead of touching all users we could make __napi_schedule_irqoff >> an alias for __napi_schedule for now. >> >> [0] https://lkml.org/lkml/2020/10/8/706 > > We're effectively calling raise_softirq_irqoff() from IRQ handlers, > with force_irqthreads == true that's no longer legal. Hrmpf, indeed. When force threading was introduced that did not exist. The forced threaded handler is always invoked with bottom halfs disabled and bottom half processing either happens when the handler returns and the thread wrapper invokes local_bh_enable() or from ksoftirq. As everything runs in thread context CPU local serialization through local_bh_disable() is sufficient. > Thomas - is the expectation that IRQ handlers never assume they have > IRQs disabled going forward? We don't have any performance numbers > but if I'm reading Agner's tables right POPF is 18 cycles on Broadwell. > Is PUSHF/POPF too cheap to bother? It's not only PUSHF/POPF it's PUSHF,CLI -> POPF, but yeah it's pretty cheap nowadays. But doing the static key change might still be a good thing. Completely untested patch below. Quoting Eric: > I have to say I do not understand why we want to defer to a thread the > hard IRQ that we use in NAPI model. > > Whole point of NAPI was to keep hard irq handler very short. Right. In case the interrupt handler is doing not much more than scheduling NAPI then you can request it with IRQF_NO_THREAD, which will prevent it from being threaded even on RT. > We should focus on transferring the NAPI work (potentially disrupting > ) to a thread context, instead of the very minor hard irq trigger. Read about that. I only looked briefly at the patches and wondered why this has it's own threading mechanism and is not using the irq thread mechanics. I'll have a closer look in the next days. Thanks, tglx --- --- a/drivers/ide/ide-iops.c +++ b/drivers/ide/ide-iops.c @@ -109,7 +109,6 @@ int __ide_wait_stat(ide_drive_t *drive, ide_hwif_t *hwif = drive->hwif; const struct ide_tp_ops *tp_ops = hwif->tp_ops; unsigned long flags; - bool irqs_threaded = force_irqthreads; int i; u8 stat; @@ -117,7 +116,7 @@ int __ide_wait_stat(ide_drive_t *drive, stat = tp_ops->read_status(hwif); if (stat & ATA_BUSY) { - if (!irqs_threaded) { + if (!force_irqthreads_active()) { local_save_flags(flags); local_irq_enable_in_hardirq(); } @@ -133,13 +132,13 @@ int __ide_wait_stat(ide_drive_t *drive, if ((stat & ATA_BUSY) == 0) break; - if (!irqs_threaded) + if (!force_irqthreads_active()) local_irq_restore(flags); *rstat = stat; return -EBUSY; } } - if (!irqs_threaded) + if (!force_irqthreads_active()) local_irq_restore(flags); } /* --- a/drivers/ide/ide-taskfile.c +++ b/drivers/ide/ide-taskfile.c @@ -406,7 +406,8 @@ static ide_startstop_t pre_task_out_intr return startstop; } - if (!force_irqthreads && (drive->dev_flags & IDE_DFLAG_UNMASK) == 0) + if (!force_irqthreads_active() && + (drive->dev_flags & IDE_DFLAG_UNMASK) == 0) local_irq_disable(); ide_set_handler(drive, &task_pio_intr, WAIT_WORSTCASE); --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -489,12 +489,16 @@ extern int irq_set_irqchip_state(unsigne #ifdef CONFIG_IRQ_FORCED_THREADING # ifdef CONFIG_PREEMPT_RT -# define force_irqthreads (true) +static inline bool force_irqthreads_active(void) { return true; } # else -extern bool force_irqthreads; +extern struct static_key_false force_irqthreads_key; +static inline bool force_irqthreads_active(void) +{ + return static_branch_unlikely(&force_irqthreads_ke
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Sat, Oct 17 2020 at 16:15, Alex Belits wrote: > On Sat, 2020-10-17 at 18:08 +0200, Thomas Gleixner wrote: >> On Sat, Oct 17 2020 at 01:08, Alex Belits wrote: >> > I think that the goal of "finding source of disturbance" interface >> > is >> > different from what can be accomplished by tracing in two ways: >> > >> > 1. "Source of disturbance" should provide some useful information >> > about >> > category of event and it cause as opposed to determining all >> > precise >> > details about things being called that resulted or could result in >> > disturbance. It should not depend on the user's knowledge about >> > details >> >> Tracepoints already give you selectively useful information. > > Carefully placed tracepoints also can give the user information about > failures of open(), write(), execve() or mmap(). However syscalls still > provide an error code instead of returning generic failure and letting > user debug the cause. I have absolutely no idea what you are trying to tell me. Thanks, tglx
Re: [EXT] Re: [PATCH v4 03/13] task_isolation: userspace hard isolation from kernel
On Sat, Oct 17 2020 at 01:08, Alex Belits wrote: > On Mon, 2020-10-05 at 14:52 -0400, Nitesh Narayan Lal wrote: >> On 10/4/20 7:14 PM, Frederic Weisbecker wrote: > I think that the goal of "finding source of disturbance" interface is > different from what can be accomplished by tracing in two ways: > > 1. "Source of disturbance" should provide some useful information about > category of event and it cause as opposed to determining all precise > details about things being called that resulted or could result in > disturbance. It should not depend on the user's knowledge about > details Tracepoints already give you selectively useful information. > of implementations, it should provide some definite answer of what > happened (with whatever amount of details can be given in a generic > mechanism) even if the user has no idea how those things happen and > what part of kernel is responsible for either causing or processing > them. Then if the user needs further details, they can be obtained with > tracing. It's just a matter of defining the tracepoint at the right place. > 2. It should be usable as a runtime error handling mechanism, so the > information it provides should be suitable for application use and > logging. It should be usable when applications are running on a system > in production, and no specific tracing or monitoring mechanism can be > in use. That's a strawman really. There is absolutely no reason why a specific set of tracepoints cannot be enabled on a production system. Your tracker is a monitoring mechanism, just a different flavour. By your logic above it cannot be enabled on a production system either. Also you can enable tracepoints from a control application, consume, log and act upon them. It's not any different from opening some magic isolation tracker interface. There are even multiple ways to do that including libraries. > If, say, thousands of devices are controlling neutrino detectors on an > ocean floor, and in a month of work one of them got one isolation > breaking event, it should be able to report that isolation was broken > by an interrupt from a network interface, so the users will be able to > track it down to some userspace application reconfiguring those > interrupts. Tracing can do that and it can do it selectively on the isolated CPUs. It's just a matter of proper configuration and usage. > It will be a good idea to make such mechanism optional and suitable for > tracking things on conditions other than "always enabled" and "enabled > with task isolation". Tracing already provides that. Tracepoints are individually controlled and filtered. > However in my opinion, there should be something in kernel entry > procedure that, if enabled, prepared something to be filled by the > cause data, and we know at least one such situation when this kernel > entry procedure should be triggered -- when task isolation is on. A tracepoint will gather that information for you. task isolation is not special, it's just yet another way to configure and use a system and tracepoints provide everything you need with the bonus that you can gather more correlated information when you need it. In fact tracing and tracepoints have replaced all specialized trackers which were in the kernel before tracing was available. We're not going to add a new one just because. If there is anything which you find that tracing and tracepoints cannot provide then the obvious solution is to extend that infrastructure so it can serve your usecase. Thanks, tglx
Re: [PATCH 0/7] TC-ETF support PTP clocks series
Andreas, On Wed, Oct 14 2020 at 09:12, Andreas Meisinger wrote: > Sorry about the wrong format/style of my last mail, hope I did get it > right this time. No problem. Yes this looks better. The only thing which could be improved is that your mail client fails to add In-Reply-To: References: ... headers and instead has the MS lookout specific Thread-Topic: [PATCH 0/7] TC-ETF support PTP clocks series Thread-Index: AdaiB8a+x+RhfhtwSZ+NKfvRdyiJkw== headers. If you look at the lore archive you see the effect: https://lore.kernel.org/r/am0pr10mb3073f9694ecad4f612a86fa7fa...@am0pr10mb3073.eurprd10.prod.outlook.com and that happens to all mail clients which use threading based on the standard headers. There is config knob in lookout to enable them. > Let me first point at an important thing because we did have > discussions here about it too. As of the manpages Linux CLOCK_TAI > seems to be defined as an nonsettable clock which does have the same > behaviour as of international atomic time TAI. Yet if it's nonsettable > it can't be linked or synchronized to the value of International > Atomic Time? > > On the other hand there seems to be code in place to change the > CLOCK_TAI offset thus making it basically sort of "setable"? It obviously needs to be modifiable in some way, otherwise synchronization to a master clock via PTP would not work at all. But it cannot be set in the way of settimeofday()/clock_settime() like CLOCK_REALTIME. >> The user space daemon which does the correlation between these PTP >> domains and TAI is required in any case, so the magic clock >> TAI_PRIVATE is not having any advantage. > I think a userspace daemon handling the translation information > between different clocks would be fine. I think it's not really that > relevant who exactly does apply the translation. Kernel level might be > a little bit more precise but I guess it'd depend on other factors > too. Not really. The kernel still provides the timestamp pairs/triplets in the best way the underlying hardware provides it. Some hardware can even provide hardware assistet pairs of ART and PTP clock. > Yet all translation based approaches have in common, setting a clock, > renders translations done in past invalid. It would be required to fix > old translated values along with setting the clock. I assume we > couldn't distinguish between "translated" values and genuine values > after translation, so fixing them might not be possible at all. CLOCK_TAI is not really set after the initial sychronization. It's frequency corrected without causing jumps. PTP daemon uses a PLL based algorithm for that. Of course this adjustment has side effects for translation. > In our usecase we do have a clock for network operation which can't be > set. We can guarantee this because we are able to define the network > not being operational when the defined time is not available 😉. > Having this defined the remaining option would be the target clock to > be set. As of your suggestion that would be CLOCK_TAI. So at this > point "setable" or "nonsettable" would become important. Here > "setable" would introduce a dependency between the clocks. Being > independent from clocks outside of our control was exactly the reason > to introduce the separate network clock. To me this means if CLOCK_TAI > could be changed by anything it couldn't be the base clock for our > usecase if it can't it might be a solution. It's under your control as system designer how you operate CLOCK_TAI. If you use the PTP daemon then it will sync CLOCK_TAI to the PTP clock of your choice. If you don't have PTP at all then you can use NTP to sync to a time server, which is less accurate. You can use PPS or just use nothing. The kernel does not care which non-standard base time or frequency you chose. Applications which communicate over network might care if the other side uses a differnet time universe. Log files which start at 1971 might be interesting to analyse against the log file of your other system which starts in 2020. >> Depending on the frequency drift between CLOCK_TAI and clock PTP/$N >> the timer expiry might be slightly inaccurate, but surely not more >> inaccurate than if that conversion is done purely in user space. >> >> The self rearming posix timers would work too, but the self rearming >> is based on CLOCK_TAI, so rounding errors and drift would be >> accumulative. So I'd rather stay away from them. > > As of the time ranges typically used in tsn networks the drift error > for single shot timers most likely isn't a big deal. But as you > suggest I'd stay away from long running timers as well rearming ones > too, I guess they'd be mostly unusable. Depends. It's a matter of hardware properties, system requirements, application/system designers decisions. So what you consider unusable for your system might be perfectly fine for others. If we add support for this type of correlation then of course these things need to be documented. > Rig
Re: [PATCH 3/3] timekeeping: remove arch_gettimeoffset
On Thu, Oct 08 2020 at 17:46, Arnd Bergmann wrote: > With Arm EBSA110 gone, nothing uses it any more, so the corresponding > code and the Kconfig option can be removed. Yay! It only took 11+ years to get rid of that. Feel free to route it through your tree. Acked-by: Thomas Gleixner
Re: AW: [PATCH 0/7] TC-ETF support PTP clocks series
Andreas, On Fri, Oct 09 2020 at 11:17, Andreas Meisinger wrote: please do not top-post and trim your replies. > Yet we do already have usecases where this can't be done. Additionally > a lot of discussions at this topic are ongoing in 60802 profile > creation too. Some of our usecases do require a network which does > not depend on any external timesource. This might be due to the > network not being connected (to the internet) or just because the > network may not be able to rely on or trust an external > timesource. Some reasons for this might be safety, security, > availability or legal implications ( e.g. if a machine builder has to > guarantee operation of a machine which depends on an internal tsn > network). I'm aware of the reasons for these kind of setups. > About your question if an application needs to be able to sync to > multiple timescales. A small count of usecases even would require > multiple independent timesources to be used. At the moment they all > seem to be located in the area of extreme high availability. There's > ongoing evaluation about this issues and we're not sure if there's a > way to do this without special hardware so we didn't address it here. Reading several raw PTP clocks is always possible through the existing interfaces and if the coordidation between real TAI and the raw PTP clocks is available, then these interfaces could be extended to provide time normalized to real TAI. But that does not allow to utilize the magic clocks for arming timers so these have to be based on some other clock and the application needs to do the conversion back and forth. Now I said that we could abuse time name spaces for providing access to _one_ magic TAI clock which lets the kernel do that work, but thinking more about it, it should be possible to do so for all of them even without name spaces. The user space daemon which does the correlation between these PTP domains and TAI is required in any case, so the magic clock TAI_PRIVATE is not having any advantage. If that correlation exists then at least clock_nanosleep() should be doable. So clock_nanosleep(clock PTP/$N) would convert the sleep time to TAI and queue a timer internally on the CLOCK_TAI base. Depending on the frequency drift between CLOCK_TAI and clock PTP/$N the timer expiry might be slightly inaccurate, but surely not more inaccurate than if that conversion is done purely in user space. The self rearming posix timers would work too, but the self rearming is based on CLOCK_TAI, so rounding errors and drift would be accumulative. So I'd rather stay away from them. If there is no deamon which manages the correlation then the syscall would fail. If such a coordination exists, then the whole problem in the TSN stack is gone. The core can always operate on TAI and the network device which runs in a different time universe would use the same conversion data e.g. to queue a packet for HW based time triggered transmission. Again subject to slight inaccuracy, but it does not come with all the problems of dynamic clocks, locking issues etc. As the frequency drift between PTP domains is neither fast changing nor randomly jumping around the inaccuracy might even be a mostly academic problem. Thoughts? Thanks, tglx
Re: [PATCH 0/7] TC-ETF support PTP clocks series
Vinicius, On Fri, Oct 02 2020 at 12:01, Vinicius Costa Gomes wrote: > I think that there's an underlying problem/limitation that is the cause > of the issue (or at least a step in the right direction) you are trying > to solve: the issue is that PTP clocks can't be used as hrtimers. That's only an issue if PTP time != CLOCK_TAI, which is insane to begin with. As I know that these insanities exists in real world setups, e.g. grand clock masters which start at the epoch which causes complete disaster when any of the slave devices booted earlier. Obviously people came up with system designs which are even more insane. > I didn't spend a lot of time thinking about how to solve this (the only > thing that comes to mind is having a timecounter, or similar, "software > view" over the PHC clock). There are two aspects: 1) What's the overall time coordination especially for applications? PTP is for a reason based on TAI which allows a universal representation of time. Strict monotonic, no time zones, no leap seconds, no bells and whistels. Using TAI in distributed systems solved a gazillion of hard problems in one go. TSN depends on PTP and that obviously makes CLOCK_TAI _the_ clock of choice for schedules and whatever is needed. It just solves the problem nicely and we spent a great amount of time to make application development for TSN reasonable and hardware agnostic. Now industry comes along and decides to introducde independent time universes. The result is a black hole for programmers because they now have to waste effort - again - on solving the incredibly hard problems of warping space and time. The amount of money saved by not having properly coordinated time bases in such systems is definitely marginal compared to the amount of non-sensical work required to fix it in software. 2) How can an OS provide halfways usable interfaces to handle this trainwreck? Access to the various time universes is already available through the dynamic POSIX clocks. But these interfaces have been designed for the performance insensitive work of PTP daemons and not for the performance critical work of applications dealing with real-time requirements of all sorts. As these raw PTP clocks are hardware dependend and only known at boot / device discovery time they cannot be exposed to the kernel internaly in any sane way. Also the user space interface has to be dynamic which rules out the ability to assign fixed CLOCK_* ids. As a consequence these clocks cannot provide timers like the regular CLOCK_* variants do, which makes it insanely hard to develop sane and portable applications. What comes to my mind (without spending much thought on it) is: 1) Utilize and extend the existing PTP mechanisms to calculate the time relationship between the system wide CLOCK_TAI and the uncoordinated time universe. As offset is a constant and frequency drift is not a high speed problem this can be done with a userspace daemon of some sorts. 2) Provide CLOCK_TAI_PRIVATE which defaults to CLOCK_TAI, i.e. offset = 0 and frequency ratio = 1 : 1 3) (Ab)use the existing time namespace to provide a mechanism to adjust the offset and frequency ratio of CLOCK_TAI_PRIVATE which is calculated by #1 This is the really tricky part and comes with severe limitations: - We can't walk task list to find tasks which have their CLOCK_TAI_PRIVATE associated with a particular incarnation of PCH/PTP universe, so some sane referencing of the underlying parameters to convert TAI to TAI_PRIVATE and vice versa has to be found. Life time problems are going to be interesting to deal with. - An application cannot coordinate multiple PCH/PTP domains and has to restrict itself to pick ONE disjunct time universe. Whether that's a reasonable limitation I don't know simply because the information provided in this patch series is close to zero. - Preventing early timer expiration caused by frequency drift is not trivial either. TBH, just thinking about all of that makes me shudder and my knee jerk reaction is: NO WAY! Why the heck can't hardware people and system designers finally understand that time is not something they can define at their own peril? The "Let's solve it in software so I don't have to think about it" design approach strikes again. This caused headaches for the past five decades, but people obviously never learn. That said, I'm open for solutions which are at least in the proximity of sane, but that needs a lot more information about the use cases and the implications and not just some hand
Re: [PATCH 7/7] TC-ETF support PTP clocks
On Thu, Oct 01 2020 at 22:51, Erez Geva wrote: > - Add support for using a POSIX dynamic clock with > Traffic control Earliest TxTime First (ETF) Qdisc. > --- a/include/uapi/linux/net_tstamp.h > +++ b/include/uapi/linux/net_tstamp.h > @@ -167,6 +167,11 @@ enum txtime_flags { > SOF_TXTIME_FLAGS_MASK = (SOF_TXTIME_FLAGS_LAST - 1) | >SOF_TXTIME_FLAGS_LAST > }; > +/* > + * Clock ID to use with POSIX clocks > + * The ID must be u8 to fit in (struct sock)->sk_clockid > + */ > +#define SOF_TXTIME_POSIX_CLOCK_ID (0x77) Random number with a random name. > struct sock_txtime { > __kernel_clockid_t clockid;/* reference clockid */ > diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c > index c0de4c6f9299..8e3e0a61fa58 100644 > --- a/net/sched/sch_etf.c > +++ b/net/sched/sch_etf.c > @@ -15,6 +15,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -40,19 +41,40 @@ struct etf_sched_data { > struct rb_root_cached head; > struct qdisc_watchdog watchdog; > ktime_t (*get_time)(void); > +#ifdef CONFIG_POSIX_TIMERS > + struct posix_clock *pclock; /* pointer to a posix clock */ Tail comments suck because they disturb the reading flow and this comment has absolute zero value. Comments are required to explain things which are not obvious... > +#endif /* CONFIG_POSIX_TIMERS */ Also this #ifdeffery is bonkers. How is TSN supposed to work without POSIX_TIMERS in the first place? > static const struct nla_policy etf_policy[TCA_ETF_MAX + 1] = { > [TCA_ETF_PARMS] = { .len = sizeof(struct tc_etf_qopt) }, > }; > > +static inline ktime_t get_now(struct Qdisc *sch, struct etf_sched_data *q) > +{ > +#ifdef CONFIG_POSIX_TIMERS > + if (IS_ERR_OR_NULL(q->get_time)) { > + struct timespec64 ts; > + int err = posix_clock_gettime(q->pclock, &ts); > + > + if (err) { > + pr_warn("Clock is disabled (%d) for queue %d\n", > + err, q->queue); > + return 0; That's really useful error handling. > + } > + return timespec64_to_ktime(ts); > + } > +#endif /* CONFIG_POSIX_TIMERS */ > + return q->get_time(); > +} > + > static inline int validate_input_params(struct tc_etf_qopt *qopt, > struct netlink_ext_ack *extack) > { > /* Check if params comply to the following rules: >* * Clockid and delta must be valid. >* > - * * Dynamic clockids are not supported. > + * * Dynamic CPU clockids are not supported. >* >* * Delta must be a positive or zero integer. >* > @@ -60,11 +82,22 @@ static inline int validate_input_params(struct > tc_etf_qopt *qopt, >* expect that system clocks have been synchronized to PHC. >*/ > if (qopt->clockid < 0) { > +#ifdef CONFIG_POSIX_TIMERS > + /** > + * Use of PTP clock through a posix clock. > + * The TC application must open the posix clock device file > + * and use the dynamic clockid from the file description. What? How is the code which calls into this guaranteed to have a valid file descriptor open for a particular dynamic posix clock? > + */ > + if (!is_clockid_fd_clock(qopt->clockid)) { > + NL_SET_ERR_MSG(extack, > +"Dynamic CPU clockids are not > supported"); > + return -EOPNOTSUPP; > + } > +#else /* CONFIG_POSIX_TIMERS */ > NL_SET_ERR_MSG(extack, "Dynamic clockids are not supported"); > return -ENOTSUPP; > - } > - > - if (qopt->clockid != CLOCK_TAI) { > +#endif /* CONFIG_POSIX_TIMERS */ > + } else if (qopt->clockid != CLOCK_TAI) { > NL_SET_ERR_MSG(extack, "Invalid clockid. CLOCK_TAI must be > used"); > return -EINVAL; > } > @@ -103,7 +136,7 @@ static bool is_packet_valid(struct Qdisc *sch, struct > etf_sched_data *q, > return false; > > skip: > - now = q->get_time(); > + now = get_now(sch, q); Yuck. is_packet_valid() is invoked via: __dev_queue_xmit() __dev_xmit_skb() etf_enqueue_timesortedlist() is_packet_valid() __dev_queue_xmit() does rcu_read_lock_bh(); and your get_now() does posix_clock_gettime() down_read(&clk->rwsem); > FAIL down_read() might sleep and cannot be called from a BH disabled region. This clearly has never been tested with any mandatory debug option enabled. Why am I not surprised? Aside of accessing PCH clock being slow at hell this cannot ever work and there is no way to make it work in any consistent form. If you have several NICs on several PCH domains then all of these domains should have one thing in common: CLOCK_TAI and the frequency. If that'
Re: [PATCH 2/7] Function to retrieve main clock state
On Fri, Oct 02 2020 at 00:05, Thomas Gleixner wrote: > On Thu, Oct 01 2020 at 22:51, Erez Geva wrote: > > same comments as for patch 1 apply. > >> Add kernel function to retrieve main clock oscillator state. > > The function you are adding is named adjtimex(). adjtimex(2) is a well > known user space interface and naming a read only kernel interface the > same way is misleading. Aside of that there is no user for this function in this series. We're not adding interfaces just because we can. Thanks, tglx
Re: [PATCH 3/7] Functions to fetch POSIX dynamic clock object
On Thu, Oct 01 2020 at 22:51, Erez Geva wrote: > Add kernel functions to fetch a pointer to a POSIX dynamic clock > using a user file description dynamic clock ID. And how is that supposed to work. What are the lifetime rules? > +struct posix_clock *posix_clock_get_clock(clockid_t id) > +{ > + int err; > + struct posix_clock_desc cd; The core code uses reverse fir tree ordering of variable declaration based on the length: struct posix_clock_desc cd; int err; > + /* Verify we use posix clock ID */ > + if (!is_clockid_fd_clock(id)) > + return ERR_PTR(-EINVAL); > + > + err = get_clock_desc(id, &cd); So this is a kernel interface and get_clock_desc() does: struct file *fp = fget(clockid_to_fd(id)); How is that file descriptor valid in random kernel context? > + if (err) > + return ERR_PTR(err); > + > + get_device(cd.clk->dev); The purpose of this is? Comments are overrated... > + put_clock_desc(&cd); > + > + return cd.clk; > +} > +EXPORT_SYMBOL_GPL(posix_clock_get_clock); > + > +int posix_clock_put_clock(struct posix_clock *clk) > +{ > + if (IS_ERR_OR_NULL(clk)) > + return -EINVAL; > + put_device(clk->dev); > + return 0; > +} > +EXPORT_SYMBOL_GPL(posix_clock_put_clock); > + > +int posix_clock_gettime(struct posix_clock *clk, struct timespec64 *ts) > +{ > + int err; > + > + if (IS_ERR_OR_NULL(clk)) > + return -EINVAL; > + > + down_read(&clk->rwsem); Open coding the logic of get_posix_clock() and having a copy here and in the next function is really useful. Thanks, tglx
Re: [PATCH 5/7] Traffic control using high-resolution timer issue
On Thu, Oct 01 2020 at 22:51, Erez Geva wrote: Issue? You again fail to decribe the problem. > - Add new schedule function for Qdisc watchdog. > The function avoids reprogram if watchdog already expire > before new expire. Why can't the existing function not be changed to do that? > - Use new schedule function in ETF. > > - Add ETF range value to kernel configuration. > as the value is characteristic to Hardware. No. That's completely wrong. Hardware properties need to be established at boot/runtime otherwise you can't build a kernel which runs on different platforms. > +void qdisc_watchdog_schedule_soon_ns(struct qdisc_watchdog *wd, u64 expires, > + u64 delta_ns) schedule soon? That sounds like schedule it sooner than later, but I don't care. > +{ > + if (test_bit(__QDISC_STATE_DEACTIVATED, > + &qdisc_root_sleeping(wd->qdisc)->state)) > + return; > + > + if (wd->last_expires == expires) > + return; How is this supposed to be valid without checking whether the timer is queued in the first place? Maybe the function name should be schedule_soon_or_not() > + /** Can you please use proper comment style? This is neither network comment style nor the regular sane kernel comment style. It's kerneldoc comment style which is reserved for function and struct documentation. > + * If expires is in [0, now + delta_ns], > + * do not program it. This range info is just confusing. Just use plain english. > + */ > + if (expires <= ktime_to_ns(hrtimer_cb_get_time(&wd->timer)) + delta_ns) > + return; So if the watchdog is NOT queued and expiry < now + delta_ns then this returns and nothing starts the timer ever. That might all be correct, but without any useful explanation it looks completely bogus. > + /** > + * If timer is already set in [0, expires + delta_ns], > + * do not reprogram it. > + */ > + if (hrtimer_is_queued(&wd->timer) && > + wd->last_expires <= expires + delta_ns) > + return; > + > + wd->last_expires = expires; > + hrtimer_start_range_ns(&wd->timer, > +ns_to_ktime(expires), > +delta_ns, > +HRTIMER_MODE_ABS_PINNED); > +} > +EXPORT_SYMBOL(qdisc_watchdog_schedule_soon_ns); > + > void qdisc_watchdog_cancel(struct qdisc_watchdog *wd) > { > hrtimer_cancel(&wd->timer); > diff --git a/net/sched/sch_etf.c b/net/sched/sch_etf.c > index c48f91075b5c..48b2868c4672 100644 > --- a/net/sched/sch_etf.c > +++ b/net/sched/sch_etf.c > @@ -20,6 +20,11 @@ > #include > #include > > +#ifdef CONFIG_NET_SCH_ETF_TIMER_RANGE > +#define NET_SCH_ETF_TIMER_RANGE CONFIG_NET_SCH_ETF_TIMER_RANGE > +#else > +#define NET_SCH_ETF_TIMER_RANGE (5 * NSEC_PER_USEC) > +#endif > #define DEADLINE_MODE_IS_ON(x) ((x)->flags & TC_ETF_DEADLINE_MODE_ON) > #define OFFLOAD_IS_ON(x) ((x)->flags & TC_ETF_OFFLOAD_ON) > #define SKIP_SOCK_CHECK_IS_SET(x) ((x)->flags & TC_ETF_SKIP_SOCK_CHECK) > @@ -128,8 +133,9 @@ static void reset_watchdog(struct Qdisc *sch) > return; > } > > - next = ktime_sub_ns(skb->tstamp, q->delta); > - qdisc_watchdog_schedule_ns(&q->watchdog, ktime_to_ns(next)); > + next = ktime_sub_ns(skb->tstamp, q->delta + NET_SCH_ETF_TIMER_RANGE); > + qdisc_watchdog_schedule_soon_ns(&q->watchdog, ktime_to_ns(next), > + NET_SCH_ETF_TIMER_RANGE); This is changing 5 things at once. That's just wrong. patch 1: Add the new function and explain why it's required patch 2: Make reset_watchdog() use it patch 3: Add a mechanism to retrieve the magic hardware range from the underlying hardware driver and add that value to q->delta or set q->delta to it. Whatever makes sense. The current solution makes no sense at all Thanks, tglx
Re: [PATCH 4/7] Fix qdisc_watchdog_schedule_range_ns range check
On Thu, Oct 01 2020 at 22:51, Erez Geva wrote: Fixes should be at the beginning of a patch series and not be hidden somewhere in the middle. >- As all parameters are unsigned. This is not a sentence and this list style does not make that changelog more readable. >- If 'expires' is bigger than 'last_expires' then the left expression > overflows. This would be the most important information and should be clearly spelled out as problem description at the very beginning of the change log. >- It is better to use addition and check both ends of the range. Is better? Either your change is correcting the problem or not. Just better but incorrect does not cut it. But let's look at the problem itself. The check is about: B <= A <= B + C A, B, C are all unsigned. So if B > A then the result is false. Now lets look at the implementation: if (A - B <= C) return; which works correctly due the wonders of unsigned math. For B <= A the check is obviously correct. If B > A then the result of the unsigned subtraction A - B is a very large positive number which is pretty much guaranteed to be larger than C, i.e. the result is false. So while not immediately obvious, it's still correct. Thanks, tglx