On 09/26/2012 10:36 PM, Suresh Siddha wrote: > On Wed, 2012-09-26 at 21:33 +0530, Srivatsa S. Bhat wrote: >> I have some fundamental questions here: >> 1. Why was the CPU never removed from the affinity masks in the original >> code? I find it hard to believe that it was just an oversight, because the >> whole point of fixup_irqs() is to affine the interrupts to other CPUs, IIUC. >> So, is that really a bug or is the existing code correct for some reason >> which I don't know of? > > I am not aware of the history but my guess is that the affinity mask > which is coming from the user-space wants to be preserved. And > fixup_irqs() is fixing the underlying interrupt routing when the cpu > goes down
and the code that corresponds to that is: irq_force_complete_move(irq); is it? > with a hope that things will be corrected when the cpu comes > back online. But as Liu noted, we are not correcting the underlying > routing when the cpu comes back online. I think we should fix that > rather than modifying the user-specified affinity. > Hmm, I didn't entirely get your suggestion. Are you saying that we should change data->affinity (by calling ->irq_set_affinity()) during offline but maintain a copy of the original affinity mask somewhere, so that we can try to match it when possible (ie., when CPU comes back online)? >> 2. In case this is indeed a bug, why are the warnings ratelimited when the >> interrupts can't be affined to other CPUs? Are they not serious enough to >> report? Put more strongly, why do we even silently return with a warning >> instead of reporting that the CPU offline operation failed?? Is that because >> we have come way too far in the hotplug sequence and we can't easily roll >> back? Or are we still actually OK in that situation? > > Are you referring to the "cannot set affinity for irq" messages? Yes > That happens only if the irq chip doesn't have the irq_set_affinity() setup. That is my other point of concern : setting irq affinity can fail even if we have ->irq_set_affinity(). (If __ioapic_set_affinity() fails, for example). Why don't we complain in that case? I think we should... and if its serious enough, abort the hotplug operation or atleast indicate that offline failed.. > But that is not common. > >> >> Suresh, I'd be grateful if you could kindly throw some light on these >> issues... I'm actually debugging an issue where an offline CPU gets apic >> timer >> interrupts (and in one case, I even saw a device interrupt), which I have >> reported in another thread at: https://lkml.org/lkml/2012/9/26/119 >> But this issue in fixup_irqs() that Liu brought to light looks even more >> surprising to me.. > > These issues look different to me, will look into that. > Ok, thanks a lot! Regards, Srivatsa S. Bhat -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/