Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-22 Thread Dongli Zhang



On 5/21/24 5:00 AM, Thomas Gleixner wrote:
> On Wed, May 15 2024 at 12:51, Dongli Zhang wrote:
>> On 5/13/24 3:46 PM, Thomas Gleixner wrote:
>>> So yes, moving the invocation of irq_force_complete_move() before the
>>> irq_needs_fixup() call makes sense, but it wants this to actually work
>>> correctly:
>>> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
>>> goto unlock;
>>>  
>>> /*
>>> -* If prev_vector is empty, no action required.
>>> +* If prev_vector is empty or the descriptor was previously
>>> +* not on the outgoing CPU no action required.
>>>  */
>>> vector = apicd->prev_vector;
>>> -   if (!vector)
>>> +   if (!vector || apicd->prev_cpu != smp_processor_id())
>>> goto unlock;
>>>  
>>
>> The above may not work. migrate_one_irq() relies on 
>> irq_force_complete_move() to
>> always reclaim the apicd->prev_vector. Otherwise, the call of
>> irq_do_set_affinity() later may return -EBUSY.
> 
> You're right. But that still can be handled in irq_force_complete_move()
> with a single unconditional invocation in migrate_one_irq():
> 
>   cpu = smp_processor_id();
>   if (!vector || (apicd->cur_cpu != cpu && apicd->prev_cpu != cpu))
>   goto unlock;

The current affine is apicd->cpu :)

Thank you very much for the suggestion!

> 
> because there are only two cases when a cleanup is required:
> 
>1) The outgoing CPU is the current target
> 
>2) The outgoing CPU was the previous target
> 
> No?

I agree with this statement.

My only concern is: while we use "apicd->cpu", the irq_needs_fixup() uses a
different way. It uses d->common->effective_affinity or d->common->affinity to
decide whether to move forward to migrate the interrupt.

I have spent some time reading about the discussion that happened in the year
2017 (below link). According to my understanding,
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK always relies on CONFIG_SMP, and we do not
have the chance to encounter the issue for x86.

https://lore.kernel.org/all/alpine.DEB.2.20.1710042208400.2406@nanos/T/#u

I have tested the new patch for a while and never encountered any issue.

Therefore, I will send v2.

Thank you very much for all suggestions!

Dongli Zhang



Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-21 Thread Thomas Gleixner
On Wed, May 15 2024 at 12:51, Dongli Zhang wrote:
> On 5/13/24 3:46 PM, Thomas Gleixner wrote:
>> So yes, moving the invocation of irq_force_complete_move() before the
>> irq_needs_fixup() call makes sense, but it wants this to actually work
>> correctly:
>> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
>>  goto unlock;
>>  
>>  /*
>> - * If prev_vector is empty, no action required.
>> + * If prev_vector is empty or the descriptor was previously
>> + * not on the outgoing CPU no action required.
>>   */
>>  vector = apicd->prev_vector;
>> -if (!vector)
>> +if (!vector || apicd->prev_cpu != smp_processor_id())
>>  goto unlock;
>>  
>
> The above may not work. migrate_one_irq() relies on irq_force_complete_move() 
> to
> always reclaim the apicd->prev_vector. Otherwise, the call of
> irq_do_set_affinity() later may return -EBUSY.

You're right. But that still can be handled in irq_force_complete_move()
with a single unconditional invocation in migrate_one_irq():

cpu = smp_processor_id();
if (!vector || (apicd->cur_cpu != cpu && apicd->prev_cpu != cpu))
goto unlock;

because there are only two cases when a cleanup is required:

   1) The outgoing CPU is the current target

   2) The outgoing CPU was the previous target

No?

Thanks,

tglx



Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-15 Thread Dongli Zhang



On 5/13/24 3:46 PM, Thomas Gleixner wrote:
> On Mon, May 13 2024 at 10:43, Dongli Zhang wrote:
>> On 5/13/24 5:44 AM, Thomas Gleixner wrote:
>>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>>> Any interrupt which is affine to an outgoing CPU is migrated and
>>> eventually pending moves are enforced:
>>>
>>> cpu_down()
>>>   ...
>>>   cpu_disable_common()
>>> fixup_irqs()
>>>   irq_migrate_all_off_this_cpu()
>>> migrate_one_irq()
>>>   irq_force_complete_move()
>>> free_moved_vector();
>>>
>>> No?
>>
>> I noticed this and finally abandoned the solution to fix at 
>> migrate_one_irq(),
>> because:
>>
>> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
>> cleanup before irq_do_set_affinity().
>>
>> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() 
>> does
>> not get the chance to trigger.
>>
>> 3. Even irq_force_complete_move() is triggered, it exits early if
>> apicd->prev_vector==0.
> 
> But that's not the case, really.
> 
>> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
>> cpu_disable_common() releases the vector_lock after CPU is flagged offline.
> 
> Nothing can schedule vector cleanup at that point because _all_ other
> CPUs spin in stop_machine() with interrupts disabled and therefore
> cannot handle interrupts which might invoke it.

Thank you very much for the explanation! Now I see why to drop the vector_lock
is not an issue.


[snip]


> 
>> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
>> index 1ed2b1739363..5ecd072a34fe 100644
>> --- a/kernel/irq/cpuhotplug.c
>> +++ b/kernel/irq/cpuhotplug.c
>> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
>> return false;
>> }
>>
>> +   /*
>> +* Complete an eventually pending irq move cleanup. If this
>> +* interrupt was moved in hard irq context, then the vectors need
>> +* to be cleaned up. It can't wait until this interrupt actually
>> +* happens and this CPU was involved.
>> +*/
>> +   irq_force_complete_move(desc);
> 
> You cannot do that here because it is only valid when the interrupt is
> affine to the outgoing CPU.
> 
> In the problem case the interrupt was affine to the outgoing CPU, but
> the core code does not know that it has not been cleaned up yet. It does
> not even know that the interrupt was affine to the outgoing CPU before.
> 
> So in principle we could just go and do:
> 
>   } else {
> - apicd->prev_vector = 0;
> + free_moved_vector(apicd);
>   }
>   raw_spin_unlock(_lock);
> 
> but that would not give enough information and would depend on the
> interrupt to actually arrive.
> 
> irq_force_complete_move() already describes this case, but obviously it
> does not work because the core code does not invoke it in that
> situation.
> 
> So yes, moving the invocation of irq_force_complete_move() before the
> irq_needs_fixup() call makes sense, but it wants this to actually work
> correctly:
> 
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -1036,7 +1036,8 @@ static void __vector_schedule_cleanup(st
>   add_timer_on(>timer, cpu);
>   }
>   } else {
> - apicd->prev_vector = 0;
> + pr_warn("IRQ %u schedule cleanup for offline CPU %u\n", 
> apicd->irq, cpu);
> + free_moved_vector(apicd);
>   }
>   raw_spin_unlock(_lock);
>  }
> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
>   goto unlock;
>  
>   /*
> -  * If prev_vector is empty, no action required.
> +  * If prev_vector is empty or the descriptor was previously
> +  * not on the outgoing CPU no action required.
>*/
>   vector = apicd->prev_vector;
> - if (!vector)
> + if (!vector || apicd->prev_cpu != smp_processor_id())
>   goto unlock;
>  

The above may not work. migrate_one_irq() relies on irq_force_complete_move() to
always reclaim the apicd->prev_vector. Otherwise, the call of
irq_do_set_affinity() later may return -EBUSY.


 801 /*
 802  * Careful here. @apicd might either have move_in_progress set or
 803  * be enqueued for cleanup. Assigning a new vector would either
 804  * leave a stale vector on some CPU around or in case of a pending
 805  * cleanup corrupt the hlist.
 806  */
 807 if (apicd->move_in_progress || !hlist_unhashed(>clist))
 808 return -EBUSY;


Or maybe add a flag to control whether to enforce a cleanup in any conditions?

-void irq_force_complete_move(struct irq_desc *desc)
+void irq_force_complete_move(struct irq_desc *desc, bool always_force)
 {
struct apic_chip_data *apicd;
struct irq_data *irqd;
@@ -1102,6 +1103,9 @@ void irq_force_complete_move(struct irq_desc *desc)
if (!vector)
goto unlock;

Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Thomas Gleixner
On Mon, May 13 2024 at 10:43, Dongli Zhang wrote:
> On 5/13/24 5:44 AM, Thomas Gleixner wrote:
>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> Any interrupt which is affine to an outgoing CPU is migrated and
>> eventually pending moves are enforced:
>> 
>> cpu_down()
>>   ...
>>   cpu_disable_common()
>> fixup_irqs()
>>   irq_migrate_all_off_this_cpu()
>> migrate_one_irq()
>>   irq_force_complete_move()
>> free_moved_vector();
>> 
>> No?
>
> I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
> because:
>
> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
> cleanup before irq_do_set_affinity().
>
> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() 
> does
> not get the chance to trigger.
>
> 3. Even irq_force_complete_move() is triggered, it exits early if
> apicd->prev_vector==0.

But that's not the case, really.

> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
> cpu_disable_common() releases the vector_lock after CPU is flagged offline.

Nothing can schedule vector cleanup at that point because _all_ other
CPUs spin in stop_machine() with interrupts disabled and therefore
cannot handle interrupts which might invoke it.

So it does not matter whether the vector lock is dropped or not in
cpu_disable_common().

> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
> apic_chip_data *apicd)
> cl->timer.expires = jiffies + 1;
> add_timer_on(>timer, cpu);
> }
> -   } else {
> -   apicd->prev_vector = 0; // or print a warning

This really wants to be a warning.

>> In fact irq_complete_move() should never see apicd->move_in_progress
>> with apicd->prev_cpu pointing to an offline CPU.
>
> I think it is possible. The fact that a CPU is offline doesn't indicate
> fixup_irqs() has already been triggered. The vector_lock is released after CPU
> is flagged offline.

No.

stop_machine()
_ALL_ CPUs rendevouz and spin with interrupts disabled

Outgoing CPU invokes cpu_disable_common()

So it does not matter at all whether vector lock is dropped before
fixup_irqs() is invoked. The new target CPU _cannot_ handle the
interrupt at that point and invoke irq_complete_move().

>> If you can trigger that case, then there is something fundamentally
>> wrong with the CPU hotplug interrupt migration code and that needs to be
>> investigated and fixed.
>> 
>
> I can easily reproduce the issue.

Good.

> I will fix in the interrupt migration code.

You need a proper explanation for the problem first otherwise you can't
fix it.

I understand the failure mode by now. What happens is:

1) Interrupt is affine to CPU11

2) Affinity is set to CPU10

3) Interrupt is raised and handled on CPU11

   irq_move_masked_irq()
 irq_do_set_affinity()
   apicd->prev_cpu = 11;
   apicd->move_in_progress = true;

4) CPU11 goes offline

   irq_needs_fixup() returns false because effective affinity points
   already to CPU 10, so irq_force_complete_move() is not invoked.

5) Interrupt is raised and handled on CPU10

   irq_complete_move()
__vector_schedule_cleanup()
   if (cpu_online(apicd->prev_cpu))<- observes offline

See? So this has nothing to do with vector lock being dropped.

> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
> index 1ed2b1739363..5ecd072a34fe 100644
> --- a/kernel/irq/cpuhotplug.c
> +++ b/kernel/irq/cpuhotplug.c
> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
> return false;
> }
>
> +   /*
> +* Complete an eventually pending irq move cleanup. If this
> +* interrupt was moved in hard irq context, then the vectors need
> +* to be cleaned up. It can't wait until this interrupt actually
> +* happens and this CPU was involved.
> +*/
> +   irq_force_complete_move(desc);

You cannot do that here because it is only valid when the interrupt is
affine to the outgoing CPU.

In the problem case the interrupt was affine to the outgoing CPU, but
the core code does not know that it has not been cleaned up yet. It does
not even know that the interrupt was affine to the outgoing CPU before.

So in principle we could just go and do:

} else {
-   apicd->prev_vector = 0;
+   free_moved_vector(apicd);
}
raw_spin_unlock(_lock);

but that would not give enough information and would depend on the
interrupt to actually arrive.

irq_force_complete_move() already describes this case, but obviously it
does not work because the core code does not invoke it in that
situation.

So yes, moving the invocation of irq_force_complete_move() before the
irq_needs_fixup() call makes sense, but it wants this to actually work
correctly:

--- 

Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Dongli Zhang
Hi Thomas,

On 5/13/24 5:44 AM, Thomas Gleixner wrote:
> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
>> interrupt affinity reconfiguration via procfs. Instead, the change is
>> deferred until the next instance of the interrupt being triggered on the
>> original CPU.
>>
>> When the interrupt next triggers on the original CPU, the new affinity is
>> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
>> but if the old vector on the original CPU remains online, it is not
>> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
>> reclaiming process is delayed until the next trigger of the interrupt on
>> the new CPU.
>>
>> Upon the subsequent triggering of the interrupt on the new CPU,
>> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
>> remains online. Subsequently, the timer on the old CPU iterates over its
>> vector_cleanup list, reclaiming vectors.
>>
>> However, if the old CPU is offline before the interrupt triggers again on
>> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
>> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
>> in vector_matrix, resulting in a CPU vector leak.
> 
> I doubt that.
> 
> Any interrupt which is affine to an outgoing CPU is migrated and
> eventually pending moves are enforced:
> 
> cpu_down()
>   ...
>   cpu_disable_common()
> fixup_irqs()
>   irq_migrate_all_off_this_cpu()
> migrate_one_irq()
>   irq_force_complete_move()
> free_moved_vector();
> 
> No?

I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
because:

1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
cleanup before irq_do_set_affinity().

2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
not get the chance to trigger.

3. Even irq_force_complete_move() is triggered, it exits early if
apicd->prev_vector==0.

The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
cpu_disable_common() releases the vector_lock after CPU is flagged offline.

void cpu_disable_common(void)
{
int cpu = smp_processor_id();

remove_siblinginfo(cpu);

/* It's now safe to remove this processor from the online map */
lock_vector_lock();
remove_cpu_from_maps(cpu); ---> CPU is flagged offline
unlock_vector_lock();  ---> release the vector_lock here!
fixup_irqs();
lapic_offline();
}


Therefore, the bugfix may become something like (just to demo the idea):

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 185738c72766..247a53fe9ada 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1035,8 +1035,6 @@ static void __vector_schedule_cleanup(struct
apic_chip_data *apicd)
cl->timer.expires = jiffies + 1;
add_timer_on(>timer, cpu);
}
-   } else {
-   apicd->prev_vector = 0; // or print a warning
}
raw_spin_unlock(_lock);
 }
diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 1ed2b1739363..5ecd072a34fe 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}

+   /*
+* Complete an eventually pending irq move cleanup. If this
+* interrupt was moved in hard irq context, then the vectors need
+* to be cleaned up. It can't wait until this interrupt actually
+* happens and this CPU was involved.
+*/
+   irq_force_complete_move(desc);
+
/*
 * No move required, if:
 * - Interrupt is per cpu
@@ -87,14 +95,6 @@ static bool migrate_one_irq(struct irq_desc *desc)
return false;
}

-   /*
-* Complete an eventually pending irq move cleanup. If this
-* interrupt was moved in hard irq context, then the vectors need
-* to be cleaned up. It can't wait until this interrupt actually
-* happens and this CPU was involved.
-*/
-   irq_force_complete_move(desc);
-
/*
 * If there is a setaffinity pending, then try to reuse the pending
 * mask, so the last change of the affinity does not get lost. If



That's why I modify only the __vector_schedule_cleanup() as it looked simple.

I will fix in the CPU hotplug interrupt migration code.

> 
> In fact irq_complete_move() should never see apicd->move_in_progress
> with apicd->prev_cpu pointing to an offline CPU.

I think it is possible. The fact that a CPU is offline doesn't indicate
fixup_irqs() has already been triggered. The vector_lock is released after CPU
is flagged offline.

> 
> The CPU offline case in __vector_schedule_cleanup() should not even
> exist or at least just emit a warning.
> 
> 

Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-13 Thread Thomas Gleixner
On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
> interrupt affinity reconfiguration via procfs. Instead, the change is
> deferred until the next instance of the interrupt being triggered on the
> original CPU.
>
> When the interrupt next triggers on the original CPU, the new affinity is
> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
> but if the old vector on the original CPU remains online, it is not
> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
> reclaiming process is delayed until the next trigger of the interrupt on
> the new CPU.
>
> Upon the subsequent triggering of the interrupt on the new CPU,
> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
> remains online. Subsequently, the timer on the old CPU iterates over its
> vector_cleanup list, reclaiming vectors.
>
> However, if the old CPU is offline before the interrupt triggers again on
> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
> in vector_matrix, resulting in a CPU vector leak.

I doubt that.

Any interrupt which is affine to an outgoing CPU is migrated and
eventually pending moves are enforced:

cpu_down()
  ...
  cpu_disable_common()
fixup_irqs()
  irq_migrate_all_off_this_cpu()
migrate_one_irq()
  irq_force_complete_move()
free_moved_vector();

No?

In fact irq_complete_move() should never see apicd->move_in_progress
with apicd->prev_cpu pointing to an offline CPU.

The CPU offline case in __vector_schedule_cleanup() should not even
exist or at least just emit a warning.

If you can trigger that case, then there is something fundamentally
wrong with the CPU hotplug interrupt migration code and that needs to be
investigated and fixed.

Thanks,

tglx




Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-10 Thread Dave Hansen
On 5/10/24 12:06, Dongli Zhang wrote:
>   } else {
> + /*
> +  * This call borrows from the comments and implementation
> +  * of apic_update_vector(): "If the target CPU is offline
> +  * then the regular release mechanism via the cleanup
> +  * vector is not possible and the vector can be immediately
> +  * freed in the underlying matrix allocator.".
> +  */
> + irq_matrix_free(vector_matrix, apicd->prev_cpu,
> + apicd->prev_vector, apicd->is_managed);
>   apicd->prev_vector = 0;
>   }

I know it's just two sites, but I'd much rather spend the space on a
helper function than a copy-and-pasted comment.  Wouldn't something like
this make it like stupidly obvious what's going on:

if (cpu_online(cpu)) {
...
} else {
irq_matrix_free_offline(apicd->prev_cpu,
apicd->prev_vector,
apicd->is_managed);
apicd->prev_vector = 0;
}

/* Free a vector when the target CPU is offline */
static void irq_matrix_free_offline(...)
{
lockdep_assert_held(_lock);
WARN_ON_ONCE(!cpu_offline(apicd->prev_cpu));

/*
 * The regular release mechanism via the cleanup vector is not
 * possible.  Immediately free the vector in the underlying
 * matrix allocator.
 */
irq_matrix_free(, cpu, vector, managed);
}

It would also be rather hard to screw up even if someone called it on an
online CPU because you'd get a nice warning.