* Linus Torvalds <torva...@linux-foundation.org> wrote: > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > bit clear" seems to be a real issue.
It's interesting in particular when it happens with an edge-triggered interrupt source: it's much harder to miss level triggered IRQs, which stay around until actively handled. Edge triggered irqs are more fragile to loss of event processing. > > Anyway, maybe this sheds some more light on this issue. I can > > reproduce this at will, so let me know of other experiments to do. Btw., could you please describe (again) what your current best method for reproduction is? It's been a long discussion ... > Somebody else who knows the apic needs to also take a look, but I'd > love to hear what the actual hardware interrupt is (from that > "vector_irq[vector]" thing above. > > I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they > are 0xf0-0xff), so it sounds like an external one. But that then > requires the whole mapping table thing. > > Ingo/Peter/Jiang - is there anything else useful we could print out? > I worry about the irq movement code. Can we add printk's to when an > irq is chasing from one CPU to another and doing that > "move_in_progress" thing? I've always been scared of that code. 1) So the irq_cfg::move_in_progress field originates from: 610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.") and I agree that it looks fragile, so it would be nice to see how much (if at all?) it gets used, by sticking a few pr_info()s in it. Debug patch for the vector movement state machine attached below. Untested. 2) But ... having taken a quick look at the vector movement handling, I am scared more than ever,and I cannot convince myself that it's race free. It's possibly harmless, but still, famous last words ... For example, most ->move_in_progress transitions happen with the vector_lock held - with the exception of send_cleanup_vector(): there we are only holding the desc->lock, but that's not enough! For example, this codepath: static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector) { unsigned me; if (likely(!cfg->move_in_progress)) return; me = smp_processor_id(); if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) send_cleanup_vector(cfg); ... void send_cleanup_vector(struct irq_cfg *cfg) { ... cfg->move_in_progress = 0; } is blatantly racy, unless I missed something obvious? 2b) Initially I thought this only affects slowpaths like driver unregister or CPU hotplug, but I think this would be relevant for the fastpath of edge triggered irqs as well: void apic_ack_edge(struct irq_data *data) { irq_complete_move(irqd_cfg(data)); irq_move_irq(data); ack_APIC_irq(); } and irq_complete_move() checks and clears cfg->move_in_progress as listed above. So in most scenarios this is probably harmless, because it can in the worst case result in the missing of a ->move_in_progress flag setting. But it does not look harmless in the apic_set_affinity() code path: there we call into assign_irq_vector() without the desc->lock held, due to this gem in the IRQ core: int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) { unsigned long flags; struct irq_desc *desc = irq_get_desc_lock(irq, &flags, IRQ_GET_DESC_CHECK_GLOBAL); if (!desc) return -EINVAL; desc->affinity_hint = m; irq_put_desc_unlock(desc, flags); /* set the initial affinity to prevent every interrupt being on CPU0 */ if (m) __irq_set_affinity(irq, m, false); return 0; } argh! Now if this ever crosses path with an assign_irq_vector() call on another CPU, then I really cannot see what would save us from deep trouble: we use cfg->vector while that is possibly being modified, right? So I'd suggest to put a printk into irq_set_affinity_hint() as well, to make sure it's not used during this test. In particular: drivers/virtio/virtio_pci_common.c: irq_set_affinity_hint(irq, NULL); drivers/virtio/virtio_pci_common.c: irq_set_affinity_hint(irq, mask); might be triggering it in the virtualization code... The second patch below adds the relevant pr_info(). (untested) Now this too might be unrelated, because the affinity hint was added ages ago, in: e7a297b0d7d6 ("genirq: Add CPU mask affinity hint") and the potentially racy nature of calling into set_affinity without the desc lock held was probably not realized back then. VirtIO's use of the affinity-hint was added a while ago as well, four years ago, in: 75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue") 2c) The other thing that worries me here is that we apparently send an IRQ-move IPI while having an un-acked local APIC (!). I have vague memories that this was problematic and fragile before. To ease my worries in this area I've attached a third patch below that changes the order around. This too is ancient behavior - but might be more prominent on certain hardware (and virtualization) variants, so it has a chance to be relevant. 3) Furthermore, I also noticed that the whole vector_lock()/unlock() business is actively misleading IMHO, the vector_lock looks local to vector.c, while we have these two calls that essentially export it ... Some of that damage was inflicted in this ancient commit: d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug") the proper approach would be to protect against hotplug events in assign_irq_vector(), not the other way around! Probably not a functional problem, yet somewhat disgusting. ------------------------------- So ... all in one, these various problems might be related to the regression or not, but they sure need addressing. Thanks, Ingo diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 6cedd7914581..1dd1de9dd690 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask); cfg->move_in_progress = cpumask_intersects(cfg->old_domain, cpu_online_mask); + if (cfg->move_in_progress) + pr_info("apic: vector %02x, same-domain move in progress\n", cfg->vector); cpumask_and(cfg->domain, cfg->domain, tmp_mask); break; } @@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask) cpumask_copy(cfg->old_domain, cfg->domain); cfg->move_in_progress = cpumask_intersects(cfg->old_domain, cpu_online_mask); + if (cfg->move_in_progress) + pr_info("apic: vector %02x, new-domain move in progress\n", cfg->vector); } for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) per_cpu(vector_irq, new_cpu)[vector] = irq; @@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg) } } cfg->move_in_progress = 0; + pr_info("apic: vector %02x, vector cleared, move completed\n", cfg->vector); raw_spin_unlock_irqrestore(&vector_lock, flags); } @@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg) free_cpumask_var(cleanup_mask); } cfg->move_in_progress = 0; + pr_info("apic: vector %02x, sent cleanup vector, move completed\n", cfg->vector); } asmlinkage __visible void smp_irq_move_cleanup_interrupt(void) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index c0a1100d911f..18300a7b8ed2 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) desc->affinity_hint = m; irq_put_desc_unlock(desc, flags); /* set the initial affinity to prevent every interrupt being on CPU0 */ - if (m) + if (m) { + pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq); __irq_set_affinity(irq, m, false); + } return 0; } EXPORT_SYMBOL_GPL(irq_set_affinity_hint); diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 6cedd7914581..833a981c5420 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data) void apic_ack_edge(struct irq_data *data) { + ack_APIC_irq(); + + /* Might generate IPIs, so do this after having ACKed the APIC: */ irq_complete_move(irqd_cfg(data)); irq_move_irq(data); - ack_APIC_irq(); } /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/