Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On 5/21/24 5:00 AM, Thomas Gleixner wrote: > On Wed, May 15 2024 at 12:51, Dongli Zhang wrote: >> On 5/13/24 3:46 PM, Thomas Gleixner wrote: >>> So yes, moving the invocation of irq_force_complete_move() before the >>> irq_needs_fixup() call makes sense, but it wants this to actually work >>> correctly: >>> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_ >>> goto unlock; >>> >>> /* >>> -* If prev_vector is empty, no action required. >>> +* If prev_vector is empty or the descriptor was previously >>> +* not on the outgoing CPU no action required. >>> */ >>> vector = apicd->prev_vector; >>> - if (!vector) >>> + if (!vector || apicd->prev_cpu != smp_processor_id()) >>> goto unlock; >>> >> >> The above may not work. migrate_one_irq() relies on >> irq_force_complete_move() to >> always reclaim the apicd->prev_vector. Otherwise, the call of >> irq_do_set_affinity() later may return -EBUSY. > > You're right. But that still can be handled in irq_force_complete_move() > with a single unconditional invocation in migrate_one_irq(): > > cpu = smp_processor_id(); > if (!vector || (apicd->cur_cpu != cpu && apicd->prev_cpu != cpu)) > goto unlock; The current affine is apicd->cpu :) Thank you very much for the suggestion! > > because there are only two cases when a cleanup is required: > >1) The outgoing CPU is the current target > >2) The outgoing CPU was the previous target > > No? I agree with this statement. My only concern is: while we use "apicd->cpu", the irq_needs_fixup() uses a different way. It uses d->common->effective_affinity or d->common->affinity to decide whether to move forward to migrate the interrupt. I have spent some time reading about the discussion that happened in the year 2017 (below link). According to my understanding, CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK always relies on CONFIG_SMP, and we do not have the chance to encounter the issue for x86. https://lore.kernel.org/all/alpine.DEB.2.20.1710042208400.2406@nanos/T/#u I have tested the new patch for a while and never encountered any issue. Therefore, I will send v2. Thank you very much for all suggestions! Dongli Zhang
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On Wed, May 15 2024 at 12:51, Dongli Zhang wrote: > On 5/13/24 3:46 PM, Thomas Gleixner wrote: >> So yes, moving the invocation of irq_force_complete_move() before the >> irq_needs_fixup() call makes sense, but it wants this to actually work >> correctly: >> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_ >> goto unlock; >> >> /* >> - * If prev_vector is empty, no action required. >> + * If prev_vector is empty or the descriptor was previously >> + * not on the outgoing CPU no action required. >> */ >> vector = apicd->prev_vector; >> -if (!vector) >> +if (!vector || apicd->prev_cpu != smp_processor_id()) >> goto unlock; >> > > The above may not work. migrate_one_irq() relies on irq_force_complete_move() > to > always reclaim the apicd->prev_vector. Otherwise, the call of > irq_do_set_affinity() later may return -EBUSY. You're right. But that still can be handled in irq_force_complete_move() with a single unconditional invocation in migrate_one_irq(): cpu = smp_processor_id(); if (!vector || (apicd->cur_cpu != cpu && apicd->prev_cpu != cpu)) goto unlock; because there are only two cases when a cleanup is required: 1) The outgoing CPU is the current target 2) The outgoing CPU was the previous target No? Thanks, tglx
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On 5/13/24 3:46 PM, Thomas Gleixner wrote: > On Mon, May 13 2024 at 10:43, Dongli Zhang wrote: >> On 5/13/24 5:44 AM, Thomas Gleixner wrote: >>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: >>> Any interrupt which is affine to an outgoing CPU is migrated and >>> eventually pending moves are enforced: >>> >>> cpu_down() >>> ... >>> cpu_disable_common() >>> fixup_irqs() >>> irq_migrate_all_off_this_cpu() >>> migrate_one_irq() >>> irq_force_complete_move() >>> free_moved_vector(); >>> >>> No? >> >> I noticed this and finally abandoned the solution to fix at >> migrate_one_irq(), >> because: >> >> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to >> cleanup before irq_do_set_affinity(). >> >> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() >> does >> not get the chance to trigger. >> >> 3. Even irq_force_complete_move() is triggered, it exits early if >> apicd->prev_vector==0. > > But that's not the case, really. > >> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because >> cpu_disable_common() releases the vector_lock after CPU is flagged offline. > > Nothing can schedule vector cleanup at that point because _all_ other > CPUs spin in stop_machine() with interrupts disabled and therefore > cannot handle interrupts which might invoke it. Thank you very much for the explanation! Now I see why to drop the vector_lock is not an issue. [snip] > >> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c >> index 1ed2b1739363..5ecd072a34fe 100644 >> --- a/kernel/irq/cpuhotplug.c >> +++ b/kernel/irq/cpuhotplug.c >> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc) >> return false; >> } >> >> + /* >> +* Complete an eventually pending irq move cleanup. If this >> +* interrupt was moved in hard irq context, then the vectors need >> +* to be cleaned up. It can't wait until this interrupt actually >> +* happens and this CPU was involved. >> +*/ >> + irq_force_complete_move(desc); > > You cannot do that here because it is only valid when the interrupt is > affine to the outgoing CPU. > > In the problem case the interrupt was affine to the outgoing CPU, but > the core code does not know that it has not been cleaned up yet. It does > not even know that the interrupt was affine to the outgoing CPU before. > > So in principle we could just go and do: > > } else { > - apicd->prev_vector = 0; > + free_moved_vector(apicd); > } > raw_spin_unlock(_lock); > > but that would not give enough information and would depend on the > interrupt to actually arrive. > > irq_force_complete_move() already describes this case, but obviously it > does not work because the core code does not invoke it in that > situation. > > So yes, moving the invocation of irq_force_complete_move() before the > irq_needs_fixup() call makes sense, but it wants this to actually work > correctly: > > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -1036,7 +1036,8 @@ static void __vector_schedule_cleanup(st > add_timer_on(>timer, cpu); > } > } else { > - apicd->prev_vector = 0; > + pr_warn("IRQ %u schedule cleanup for offline CPU %u\n", > apicd->irq, cpu); > + free_moved_vector(apicd); > } > raw_spin_unlock(_lock); > } > @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_ > goto unlock; > > /* > - * If prev_vector is empty, no action required. > + * If prev_vector is empty or the descriptor was previously > + * not on the outgoing CPU no action required. >*/ > vector = apicd->prev_vector; > - if (!vector) > + if (!vector || apicd->prev_cpu != smp_processor_id()) > goto unlock; > The above may not work. migrate_one_irq() relies on irq_force_complete_move() to always reclaim the apicd->prev_vector. Otherwise, the call of irq_do_set_affinity() later may return -EBUSY. 801 /* 802 * Careful here. @apicd might either have move_in_progress set or 803 * be enqueued for cleanup. Assigning a new vector would either 804 * leave a stale vector on some CPU around or in case of a pending 805 * cleanup corrupt the hlist. 806 */ 807 if (apicd->move_in_progress || !hlist_unhashed(>clist)) 808 return -EBUSY; Or maybe add a flag to control whether to enforce a cleanup in any conditions? -void irq_force_complete_move(struct irq_desc *desc) +void irq_force_complete_move(struct irq_desc *desc, bool always_force) { struct apic_chip_data *apicd; struct irq_data *irqd; @@ -1102,6 +1103,9 @@ void irq_force_complete_move(struct irq_desc *desc) if (!vector) goto unlock;
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On Mon, May 13 2024 at 10:43, Dongli Zhang wrote: > On 5/13/24 5:44 AM, Thomas Gleixner wrote: >> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: >> Any interrupt which is affine to an outgoing CPU is migrated and >> eventually pending moves are enforced: >> >> cpu_down() >> ... >> cpu_disable_common() >> fixup_irqs() >> irq_migrate_all_off_this_cpu() >> migrate_one_irq() >> irq_force_complete_move() >> free_moved_vector(); >> >> No? > > I noticed this and finally abandoned the solution to fix at migrate_one_irq(), > because: > > 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to > cleanup before irq_do_set_affinity(). > > 2. The irq_needs_fixup() may return false so that irq_force_complete_move() > does > not get the chance to trigger. > > 3. Even irq_force_complete_move() is triggered, it exits early if > apicd->prev_vector==0. But that's not the case, really. > The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because > cpu_disable_common() releases the vector_lock after CPU is flagged offline. Nothing can schedule vector cleanup at that point because _all_ other CPUs spin in stop_machine() with interrupts disabled and therefore cannot handle interrupts which might invoke it. So it does not matter whether the vector lock is dropped or not in cpu_disable_common(). > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct > apic_chip_data *apicd) > cl->timer.expires = jiffies + 1; > add_timer_on(>timer, cpu); > } > - } else { > - apicd->prev_vector = 0; // or print a warning This really wants to be a warning. >> In fact irq_complete_move() should never see apicd->move_in_progress >> with apicd->prev_cpu pointing to an offline CPU. > > I think it is possible. The fact that a CPU is offline doesn't indicate > fixup_irqs() has already been triggered. The vector_lock is released after CPU > is flagged offline. No. stop_machine() _ALL_ CPUs rendevouz and spin with interrupts disabled Outgoing CPU invokes cpu_disable_common() So it does not matter at all whether vector lock is dropped before fixup_irqs() is invoked. The new target CPU _cannot_ handle the interrupt at that point and invoke irq_complete_move(). >> If you can trigger that case, then there is something fundamentally >> wrong with the CPU hotplug interrupt migration code and that needs to be >> investigated and fixed. >> > > I can easily reproduce the issue. Good. > I will fix in the interrupt migration code. You need a proper explanation for the problem first otherwise you can't fix it. I understand the failure mode by now. What happens is: 1) Interrupt is affine to CPU11 2) Affinity is set to CPU10 3) Interrupt is raised and handled on CPU11 irq_move_masked_irq() irq_do_set_affinity() apicd->prev_cpu = 11; apicd->move_in_progress = true; 4) CPU11 goes offline irq_needs_fixup() returns false because effective affinity points already to CPU 10, so irq_force_complete_move() is not invoked. 5) Interrupt is raised and handled on CPU10 irq_complete_move() __vector_schedule_cleanup() if (cpu_online(apicd->prev_cpu))<- observes offline See? So this has nothing to do with vector lock being dropped. > diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c > index 1ed2b1739363..5ecd072a34fe 100644 > --- a/kernel/irq/cpuhotplug.c > +++ b/kernel/irq/cpuhotplug.c > @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc) > return false; > } > > + /* > +* Complete an eventually pending irq move cleanup. If this > +* interrupt was moved in hard irq context, then the vectors need > +* to be cleaned up. It can't wait until this interrupt actually > +* happens and this CPU was involved. > +*/ > + irq_force_complete_move(desc); You cannot do that here because it is only valid when the interrupt is affine to the outgoing CPU. In the problem case the interrupt was affine to the outgoing CPU, but the core code does not know that it has not been cleaned up yet. It does not even know that the interrupt was affine to the outgoing CPU before. So in principle we could just go and do: } else { - apicd->prev_vector = 0; + free_moved_vector(apicd); } raw_spin_unlock(_lock); but that would not give enough information and would depend on the interrupt to actually arrive. irq_force_complete_move() already describes this case, but obviously it does not work because the core code does not invoke it in that situation. So yes, moving the invocation of irq_force_complete_move() before the irq_needs_fixup() call makes sense, but it wants this to actually work correctly: ---
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
Hi Thomas, On 5/13/24 5:44 AM, Thomas Gleixner wrote: > On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: >> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of >> interrupt affinity reconfiguration via procfs. Instead, the change is >> deferred until the next instance of the interrupt being triggered on the >> original CPU. >> >> When the interrupt next triggers on the original CPU, the new affinity is >> enforced within __irq_move_irq(). A vector is allocated from the new CPU, >> but if the old vector on the original CPU remains online, it is not >> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the >> reclaiming process is delayed until the next trigger of the interrupt on >> the new CPU. >> >> Upon the subsequent triggering of the interrupt on the new CPU, >> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it >> remains online. Subsequently, the timer on the old CPU iterates over its >> vector_cleanup list, reclaiming vectors. >> >> However, if the old CPU is offline before the interrupt triggers again on >> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress >> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed >> in vector_matrix, resulting in a CPU vector leak. > > I doubt that. > > Any interrupt which is affine to an outgoing CPU is migrated and > eventually pending moves are enforced: > > cpu_down() > ... > cpu_disable_common() > fixup_irqs() > irq_migrate_all_off_this_cpu() > migrate_one_irq() > irq_force_complete_move() > free_moved_vector(); > > No? I noticed this and finally abandoned the solution to fix at migrate_one_irq(), because: 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to cleanup before irq_do_set_affinity(). 2. The irq_needs_fixup() may return false so that irq_force_complete_move() does not get the chance to trigger. 3. Even irq_force_complete_move() is triggered, it exits early if apicd->prev_vector==0. The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because cpu_disable_common() releases the vector_lock after CPU is flagged offline. void cpu_disable_common(void) { int cpu = smp_processor_id(); remove_siblinginfo(cpu); /* It's now safe to remove this processor from the online map */ lock_vector_lock(); remove_cpu_from_maps(cpu); ---> CPU is flagged offline unlock_vector_lock(); ---> release the vector_lock here! fixup_irqs(); lapic_offline(); } Therefore, the bugfix may become something like (just to demo the idea): diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 185738c72766..247a53fe9ada 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct apic_chip_data *apicd) cl->timer.expires = jiffies + 1; add_timer_on(>timer, cpu); } - } else { - apicd->prev_vector = 0; // or print a warning } raw_spin_unlock(_lock); } diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c index 1ed2b1739363..5ecd072a34fe 100644 --- a/kernel/irq/cpuhotplug.c +++ b/kernel/irq/cpuhotplug.c @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc) return false; } + /* +* Complete an eventually pending irq move cleanup. If this +* interrupt was moved in hard irq context, then the vectors need +* to be cleaned up. It can't wait until this interrupt actually +* happens and this CPU was involved. +*/ + irq_force_complete_move(desc); + /* * No move required, if: * - Interrupt is per cpu @@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc) return false; } - /* -* Complete an eventually pending irq move cleanup. If this -* interrupt was moved in hard irq context, then the vectors need -* to be cleaned up. It can't wait until this interrupt actually -* happens and this CPU was involved. -*/ - irq_force_complete_move(desc); - /* * If there is a setaffinity pending, then try to reuse the pending * mask, so the last change of the affinity does not get lost. If That's why I modify only the __vector_schedule_cleanup() as it looked simple. I will fix in the CPU hotplug interrupt migration code. > > In fact irq_complete_move() should never see apicd->move_in_progress > with apicd->prev_cpu pointing to an offline CPU. I think it is possible. The fact that a CPU is offline doesn't indicate fixup_irqs() has already been triggered. The vector_lock is released after CPU is flagged offline. > > The CPU offline case in __vector_schedule_cleanup() should not even > exist or at least just emit a warning. > >
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On Fri, May 10 2024 at 12:06, Dongli Zhang wrote: > The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of > interrupt affinity reconfiguration via procfs. Instead, the change is > deferred until the next instance of the interrupt being triggered on the > original CPU. > > When the interrupt next triggers on the original CPU, the new affinity is > enforced within __irq_move_irq(). A vector is allocated from the new CPU, > but if the old vector on the original CPU remains online, it is not > immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the > reclaiming process is delayed until the next trigger of the interrupt on > the new CPU. > > Upon the subsequent triggering of the interrupt on the new CPU, > irq_complete_move() adds a task to the old CPU's vector_cleanup list if it > remains online. Subsequently, the timer on the old CPU iterates over its > vector_cleanup list, reclaiming vectors. > > However, if the old CPU is offline before the interrupt triggers again on > the new CPU, irq_complete_move() simply resets both apicd->move_in_progress > and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed > in vector_matrix, resulting in a CPU vector leak. I doubt that. Any interrupt which is affine to an outgoing CPU is migrated and eventually pending moves are enforced: cpu_down() ... cpu_disable_common() fixup_irqs() irq_migrate_all_off_this_cpu() migrate_one_irq() irq_force_complete_move() free_moved_vector(); No? In fact irq_complete_move() should never see apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. The CPU offline case in __vector_schedule_cleanup() should not even exist or at least just emit a warning. If you can trigger that case, then there is something fundamentally wrong with the CPU hotplug interrupt migration code and that needs to be investigated and fixed. Thanks, tglx
Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
On 5/10/24 12:06, Dongli Zhang wrote: > } else { > + /* > + * This call borrows from the comments and implementation > + * of apic_update_vector(): "If the target CPU is offline > + * then the regular release mechanism via the cleanup > + * vector is not possible and the vector can be immediately > + * freed in the underlying matrix allocator.". > + */ > + irq_matrix_free(vector_matrix, apicd->prev_cpu, > + apicd->prev_vector, apicd->is_managed); > apicd->prev_vector = 0; > } I know it's just two sites, but I'd much rather spend the space on a helper function than a copy-and-pasted comment. Wouldn't something like this make it like stupidly obvious what's going on: if (cpu_online(cpu)) { ... } else { irq_matrix_free_offline(apicd->prev_cpu, apicd->prev_vector, apicd->is_managed); apicd->prev_vector = 0; } /* Free a vector when the target CPU is offline */ static void irq_matrix_free_offline(...) { lockdep_assert_held(_lock); WARN_ON_ONCE(!cpu_offline(apicd->prev_cpu)); /* * The regular release mechanism via the cleanup vector is not * possible. Immediately free the vector in the underlying * matrix allocator. */ irq_matrix_free(, cpu, vector, managed); } It would also be rather hard to screw up even if someone called it on an online CPU because you'd get a nice warning.