On 19.06.2024 10:32, Roger Pau Monné wrote:
> On Wed, Jun 19, 2024 at 09:24:41AM +0200, Jan Beulich wrote:
>> On 19.06.2024 09:05, Roger Pau Monné wrote:
>>> On Tue, Jun 18, 2024 at 06:30:22PM +0200, Jan Beulich wrote:
>>>> On 18.06.2024 16:50, Roger Pau Monné wrote:
>>>>> On Tue, Jun 18, 2024 at 04:34:50PM +0200, Jan Beulich wrote:
>>>>>> On 18.06.2024 13:30, Roger Pau Monné wrote:
>>>>>>> On Mon, Jun 17, 2024 at 03:41:12PM +0200, Jan Beulich wrote:
>>>>>>>> On 13.06.2024 18:56, Roger Pau Monne wrote:
>>>>>>>>> @@ -2686,11 +2705,27 @@ void fixup_irqs(const cpumask_t *mask, bool 
>>>>>>>>> verbose)
>>>>>>>>>          if ( desc->handler->disable )
>>>>>>>>>              desc->handler->disable(desc);
>>>>>>>>>  
>>>>>>>>> +        /*
>>>>>>>>> +         * If the current CPU is going offline and is (one of) the 
>>>>>>>>> target(s) of
>>>>>>>>> +         * the interrupt, signal to check whether there are any 
>>>>>>>>> pending vectors
>>>>>>>>> +         * to be handled in the local APIC after the interrupt has 
>>>>>>>>> been moved.
>>>>>>>>> +         */
>>>>>>>>> +        if ( !cpu_online(cpu) && cpumask_test_cpu(cpu, 
>>>>>>>>> desc->arch.cpu_mask) )
>>>>>>>>> +            check_irr = true;
>>>>>>>>> +
>>>>>>>>>          if ( desc->handler->set_affinity )
>>>>>>>>>              desc->handler->set_affinity(desc, affinity);
>>>>>>>>>          else if ( !(warned++) )
>>>>>>>>>              set_affinity = false;
>>>>>>>>>  
>>>>>>>>> +        if ( check_irr && apic_irr_read(vector) )
>>>>>>>>> +            /*
>>>>>>>>> +             * Forward pending interrupt to the new destination, 
>>>>>>>>> this CPU is
>>>>>>>>> +             * going offline and otherwise the interrupt would be 
>>>>>>>>> lost.
>>>>>>>>> +             */
>>>>>>>>> +            
>>>>>>>>> send_IPI_mask(cpumask_of(cpumask_any(desc->arch.cpu_mask)),
>>>>>>>>> +                          desc->arch.vector);
>>>>>>>>
>>>>>>>> Hmm, IRR may become set right after the IRR read (unlike in the other 
>>>>>>>> cases,
>>>>>>>> where new IRQs ought to be surfacing only at the new destination). 
>>>>>>>> Doesn't
>>>>>>>> this want moving ...
>>>>>>>>
>>>>>>>>>          if ( desc->handler->enable )
>>>>>>>>>              desc->handler->enable(desc);
>>>>>>>>
>>>>>>>> ... past the actual affinity change?
>>>>>>>
>>>>>>> Hm, but the ->enable() hook is just unmasking the interrupt, the
>>>>>>> actual affinity change is done in ->set_affinity(), and hence after
>>>>>>> the call to ->set_affinity() no further interrupts should be delivered
>>>>>>> to the CPU regardless of whether the source is masked?
>>>>>>>
>>>>>>> Or is it possible for the device/interrupt controller to not switch to
>>>>>>> use the new destination until the interrupt is unmasked, and hence
>>>>>>> could have pending masked interrupts still using the old destination?
>>>>>>> IIRC For MSI-X it's required that the device updates the destination
>>>>>>> target once the entry is unmasked.
>>>>>>
>>>>>> That's all not relevant here, I think. IRR gets set when an interrupt is
>>>>>> signaled, no matter whether it's masked.
>>>>>
>>>>> I'm kind of lost here, what does signaling mean in this context?
>>>>>
>>>>> I would expect the interrupt vector to not get set in IRR if the MSI-X
>>>>> entry is masked, as at that point the state of the address/data fields
>>>>> might not be consistent (that's the whole point of masking it right?)
>>>>>
>>>>>> It's its handling which the
>>>>>> masking would prevent, i.e. the "moving" of the set bit from IRR to ISR.
>>>>>
>>>>> My understanding was that the masking would prevent the message write to
>>>>> the APIC from happening, and hence no vector should get set in IRR.
>>>>
>>>> Hmm, yes, looks like I was confused. The masking is at the source side
>>>> (IO-APIC RTE, MSI-X entry, or - if supported - in the MSI capability).
>>>> So the sole case to worry about is MSI without mask-bit support then.
>>>
>>> Yeah, and for MSI without masking bit support we don't care doing the
>>> IRR check before or after the ->enable() hook, as that's a no-op in
>>> that case.  The write to the MSI address/data fields has already been
>>> done, and hence the issue would be exclusively with draining any
>>> in-flight writes to the APIC doorbell (what you mention below).
>>
>> Except that both here ...
>>
>>>>>> Plus we run with IRQs off here anyway if I'm not mistaken, so no
>>>>>> interrupt can be delivered to the local CPU. IOW whatever IRR bits it
>>>>>> has set (including ones becoming set between the IRR read and the actual
>>>>>> vector change), those would never be serviced. Hence the reading of the
>>>>>> bit ought to occur after the vector change: It's only then that we know
>>>>>> the IRR bit corresponding to the old vector can't become set anymore.
>>>>>
>>>>> Right, and the vector change happens in ->set_affinity(), not
>>>>> ->enable().  See for example set_msi_affinity() and the
>>>>> write_msi_msg(), that's where the vector gets changed.
>>>>>
>>>>>> And even then we're assuming that no interrupt signals might still be
>>>>>> "on their way" from the IO-APIC or a posted MSI message write by a
>>>>>> device to the LAPIC (I have no idea how to properly fence that, or
>>>>>> whether there are guarantees for this to never occur).
>>>>>
>>>>> Yeah, those I expect would be completed in the window between the
>>>>> write of the new vector/destination and the reading of IRR.
>>>>
>>>> Except we have no idea on the latencies.
>>>
>>> There isn't much else we can do? Even the current case where we add
>>> the 1ms window at the end of the shuffling could still suffer from
>>> this issue because we don't know the latencies.  IOW: I don't think
>>> this is any worse than the current approach.
>>
>> ... and here, the later we read IRR, the better the chances we don't miss
>> anything. Even the no-op ->enable() isn't a no-op execution-wise. In fact
>> it (quite pointlessly[1]) is an indirect call to irq_enable_none(). I'm
>> actually inclined to suggest that we try to even further delay the IRR
>> read, certainly past the cpumask_copy(), maybe even past the spin_unlock()
>> (latching CPU and vector into local variables, along with the latching of
>> ->affinity that's already there).
> 
> Moving past cpumask_copy() would be OK.  Moving past the spin_unlock()
> I'm not so sure.  Isn't it possible that once the desc is unlocked the
> interrupt starts another movement, and hence by the time the forwarded
> vector is injected the irq desc has already moved to yet a different
> CPU?
> 
> I don't think this is realistic given the window between the
> spin_unlock() and the injection of the vector, but could be
> possible.

Hmm, yes, in principle this is possible, if we ignore the fact that bringing
down a CPU is done in stop-machine context (i.e. IRQs are off everywhere and
initiating further movement being impossible). But perhaps indeed better not
to bake in dependencies on that.

> If you are fine with moving past the cpumask_copy() but before the
> spin_unlock() I can post an updated version with that.

Yes, please do.

Jan

Reply via email to