Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v2)

2007-05-31 Thread Siddha, Suresh B
On Thu, May 31, 2007 at 07:29:43AM -0600, Eric W. Biederman wrote:
> Andi Kleen <[EMAIL PROTECTED]> writes:
> 
> > On Thursday 31 May 2007 13:34:21 Eric W. Biederman wrote:
> >> 
> >> On x86_64 kernel, level triggered irq migration gets initiated in the 
> >> context
> >> of that interrupt(after executing the irq handler) and following steps are
> >> followed to do the irq migration.
> >
> > What's the confidence level in the patch? Is it a .22 candidate?
> 
> Yes.
> 
> I just ran it for a little while and nothing fell over, and level
> triggered irqs continued to migrate.  So it's not horribly broken.
> 
> Outside of implementation errors the worst case that can happen with
> this code is the current behavior, with a small slow down.
> 
> So since I can not reproduce the original problem I would appreciate
> an eyeball or two more just to make certain I reimplemented this
> correctly and didn't do anything stupid.   But I don't expect anything
> to be found.  The code is pretty straight forward.

Yes. We want this to get included in .22. We did extensive testing with
our patch before.  We will also do more tests with this version, however.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v3)

2007-05-31 Thread Siddha, Suresh B
On Thu, May 31, 2007 at 07:50:58AM -0600, Eric W. Biederman wrote:
> 
> On x86_64 kernel, level triggered irq migration gets initiated in the context
> of that interrupt(after executing the irq handler) and following steps are
> followed to do the irq migration.
> 
> 1. mask IOAPIC RTE entry; // write to IOAPIC RTE
> 2. EOI;   // processor EOI write
> 3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
>   // and interrupt vector due to per cpu vector
>   // allocation.
> 4. unmask IOAPIC RTE entry;   // write to IOAPIC RTE
> 
> Because of the per cpu vector allocation in x86_64 kernels, when the irq
> migrates to a different cpu, new vector(corresponding to the new cpu) will
> get allocated.
> 
> An EOI write to local APIC has a side effect of generating an EOI write
> for level trigger interrupts (normally this is a broadcast to all IOAPICs).
> The EOI broadcast generated as a side effect of EOI write to processor may
> be delayed while the other IOAPIC writes (step 3 and 4) can go through.
> 
> Normally, the EOI generated by local APIC for level trigger interrupt
> contains vector number. The IOAPIC will take this vector number and
> search the IOAPIC RTE entries for an entry with matching vector number and
> clear the remote IRR bit (indicate EOI). However, if the vector number is
> changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
> is received later. This will cause the remote IRR to get stuck causing the
> interrupt hang (no more interrupt from this RTE).
> 
> Current x86_64 kernel assumes that remote IRR bit is cleared by the time
> IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
> bit and if it still set, delay the irq migration to the next interrupt
> arrival event(hopefully, next time remote IRR bit will get cleared
> before the IOAPIC RTE is reprogrammed).
> 
> Initial analysis and patch from Nanhai.
> 
> Clean up patch from Suresh.
> 
> Rewritten to be less intrusive, and to contain a big fat comment by Eric.

Acked-by: Suresh Siddha <[EMAIL PROTECTED]>

Thanks Eric.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v3)

2007-05-31 Thread Ingo Molnar

* Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> Initial analysis and patch from Nanhai.
> 
> Clean up patch from Suresh.
> 
> Rewritten to be less intrusive, and to contain a big fat comment by 
> Eric.

thanks Erik!

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v2)

2007-05-31 Thread Ingo Molnar

looks good to me:

  Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

with a few minor style nits:

> +static int io_apic_level_ack_pending(unsigned int irq)
> +{
> + struct irq_pin_list *entry;
> + unsigned long flags;
> + int pending = 0;
> + spin_lock_irqsave(&ioapic_lock, flags);

newline after variable sections please.

> + entry = irq_2_pin + irq;
> + for (;;) {
> + unsigned int reg;
> + int pin;
> + pin = entry->pin;

ditto.

> + if (pin == -1)
> + break;
> + reg = io_apic_read(entry->apic, 0x10 + pin*2);
> + /* Is the remote IRR bit set? */
> + pending |= (reg >> 14) & 1;
> + if (!entry->next)
> + break;
> + entry = irq_2_pin + entry->next;
> + }
> + spin_unlock_irqrestore(&ioapic_lock, flags);
> + return pending;

optional: it looks a bit better with a newline before the 'return' 
statement.

> +}
> +
> +

and here there's one too much newline :-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v2)

2007-05-31 Thread Eric W. Biederman
Andi Kleen <[EMAIL PROTECTED]> writes:

> On Thursday 31 May 2007 13:34:21 Eric W. Biederman wrote:
>> 
>> On x86_64 kernel, level triggered irq migration gets initiated in the context
>> of that interrupt(after executing the irq handler) and following steps are
>> followed to do the irq migration.
>
> What's the confidence level in the patch? Is it a .22 candidate?

Yes.

I just ran it for a little while and nothing fell over, and level
triggered irqs continued to migrate.  So it's not horribly broken.

Outside of implementation errors the worst case that can happen with
this code is the current behavior, with a small slow down.

So since I can not reproduce the original problem I would appreciate
an eyeball or two more just to make certain I reimplemented this
correctly and didn't do anything stupid.   But I don't expect anything
to be found.  The code is pretty straight forward.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64 irq: check remote IRR bit before migrating level triggered irq (v2)

2007-05-31 Thread Andi Kleen
On Thursday 31 May 2007 13:34:21 Eric W. Biederman wrote:
> 
> On x86_64 kernel, level triggered irq migration gets initiated in the context
> of that interrupt(after executing the irq handler) and following steps are
> followed to do the irq migration.

What's the confidence level in the patch? Is it a .22 candidate?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-30 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> Eric,
>
> On Fri, May 18, 2007 at 07:40:53AM -0700, Eric W. Biederman wrote:
>> Still in any of those I don't see a problem with switching to edge
>> triggered mode and then back again.  Either Remote IRR will keep
>> it's current state or it will be cleared.   Remote IRR should not
>> get set (when it was clear) by such a procedure because that
>> would mess up the normal interrupt enable sequence that happens
>> on boot.  So I'm pretty certain toggling the edge bit is harmless
>> and it may actually clear Remote IRR for us.
> ...
>> 
>> I think going more the way that this code has gone on arch/i386 with
>> real functions is preferable.
>
> There is an issue with this suggestion. We have an inflight
> EOI broadcast msg to this IOAPIC (that got delayed but still alive) and
> that can cause problem if we use edge and back to level to reset remote IRR 
> bit.
>
> If the vector number stays same during irq migration and if we reset remote
> IRR bit using the above method(edge and then back to level) during
> irq migration, then we have a problem. A new interrupt arriving on a new
> cpu will set the remote IRR bit and now the old inflight EOI broadcast
> reaches IOAPIC RTE(resetting the remote IRR bit, because the vector in the
> broadcast msg is same), while the kernel code still assumes that the remote
> IRR bit is still set. This will lead to more problems and issues.

Ok.  I will agree that if the level->edge->level switch does not reset
Remote_IRR and the EOI arrives in that window.  We have another condition
in which Remote_IRR can get stuck on.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Siddha, Suresh B
On Fri, May 18, 2007 at 12:02:16PM -0700, Eric W. Biederman wrote:
> I will look closer but I do believe that from the ioapic to the cpu the 2.6.21
> code should be fairly robust with respect to inflight messages from the ioapic
> to the local apics and the cpus.  What I failed to consider were inflight
> messages in the other direction arriving out of order.

Yes. Inflight messages from cpu to ioapic was what I was referring to.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> On Fri, May 18, 2007 at 11:28:25AM -0700, Yinghai Lu wrote:
>> On 5/18/07, Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
>> >
>> > If the vector number stays same during irq migration and if we reset remote
>> > IRR bit using the above method(edge and then back to level) during
>> > irq migration, then we have a problem. A new interrupt arriving on a new
>> > cpu will set the remote IRR bit and now the old inflight EOI broadcast
>> > reaches IOAPIC RTE(resetting the remote IRR bit, because the vector in the
>> > broadcast msg is same), while the kernel code still assumes that the remote
>> > IRR bit is still set. This will lead to more problems and issues.
>> 
>> coud add some line __assign_irq_vector. to make sure old_vector!=vector.
>
> hmm..
> what happens when there is second(and very quick) irq migration which brings 
> the
> irq back to old cpu(or to a third cpu) with old vector.
>
> Point is, we are not taking care of the inflight messages(which can perhaps,
> theoretically, can get delayed for long time)

I will look closer but I do believe that from the ioapic to the cpu the 2.6.21
code should be fairly robust with respect to inflight messages from the ioapic
to the local apics and the cpus.  What I failed to consider were inflight
messages in the other direction arriving out of order.

Part of the problem in the general case is the only way you can tell an inflight
message was transmitted is that a message that you can prove followed the first
message arrives somewhere.

I'm in the middle of tracking two other problems so I can't review this in
detail until later today at the earliest.
 
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Siddha, Suresh B
On Fri, May 18, 2007 at 11:28:25AM -0700, Yinghai Lu wrote:
> On 5/18/07, Siddha, Suresh B <[EMAIL PROTECTED]> wrote:
> >
> > If the vector number stays same during irq migration and if we reset remote
> > IRR bit using the above method(edge and then back to level) during
> > irq migration, then we have a problem. A new interrupt arriving on a new
> > cpu will set the remote IRR bit and now the old inflight EOI broadcast
> > reaches IOAPIC RTE(resetting the remote IRR bit, because the vector in the
> > broadcast msg is same), while the kernel code still assumes that the remote
> > IRR bit is still set. This will lead to more problems and issues.
> 
> coud add some line __assign_irq_vector. to make sure old_vector!=vector.

hmm..
what happens when there is second(and very quick) irq migration which brings the
irq back to old cpu(or to a third cpu) with old vector.

Point is, we are not taking care of the inflight messages(which can perhaps,
theoretically, can get delayed for long time)

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Yinghai Lu

On 5/18/07, Siddha, Suresh B <[EMAIL PROTECTED]> wrote:


If the vector number stays same during irq migration and if we reset remote
IRR bit using the above method(edge and then back to level) during
irq migration, then we have a problem. A new interrupt arriving on a new
cpu will set the remote IRR bit and now the old inflight EOI broadcast
reaches IOAPIC RTE(resetting the remote IRR bit, because the vector in the
broadcast msg is same), while the kernel code still assumes that the remote
IRR bit is still set. This will lead to more problems and issues.


coud add some line __assign_irq_vector. to make sure old_vector!=vector.

YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Eric W. Biederman
"Yinghai Lu" <[EMAIL PROTECTED]> writes:

> On 5/18/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>> We can solve the problem without doing that, and keeping the same
>> vector number during migration keeps x86 from scaling.
>
> I mean ioapic level irq couls be limited. new device could use MSI or
> HT irq directly and less irq routing problem.

Possibly.  It really doesn't buy us anything until most irqs are MSI
which they are not yet.

>> Personally I would prefer to disallow irq migration.
> ? typo?
> For amd platform with different hypertransport chain on different
> nodes, irq migration is needed.

irqs not on cpu0 are needed.  irq migration is less necessary, and I 
periodically
think we are insane for supporting it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Siddha, Suresh B
Eric,

On Fri, May 18, 2007 at 07:40:53AM -0700, Eric W. Biederman wrote:
> Still in any of those I don't see a problem with switching to edge
> triggered mode and then back again.  Either Remote IRR will keep
> it's current state or it will be cleared.   Remote IRR should not
> get set (when it was clear) by such a procedure because that
> would mess up the normal interrupt enable sequence that happens
> on boot.  So I'm pretty certain toggling the edge bit is harmless
> and it may actually clear Remote IRR for us.
...
> 
> I think going more the way that this code has gone on arch/i386 with
> real functions is preferable.

There is an issue with this suggestion. We have an inflight
EOI broadcast msg to this IOAPIC (that got delayed but still alive) and
that can cause problem if we use edge and back to level to reset remote IRR bit.

If the vector number stays same during irq migration and if we reset remote
IRR bit using the above method(edge and then back to level) during
irq migration, then we have a problem. A new interrupt arriving on a new
cpu will set the remote IRR bit and now the old inflight EOI broadcast
reaches IOAPIC RTE(resetting the remote IRR bit, because the vector in the
broadcast msg is same), while the kernel code still assumes that the remote
IRR bit is still set. This will lead to more problems and issues.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Yinghai Lu

On 5/18/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

We can solve the problem without doing that, and keeping the same
vector number during migration keeps x86 from scaling.


I mean ioapic level irq couls be limited. new device could use MSI or
HT irq directly and less irq routing problem.


Personally I would prefer to disallow irq migration.

? typo?
For amd platform with different hypertransport chain on different
nodes, irq migration is needed.

YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Eric W. Biederman
"Yinghai Lu" <[EMAIL PROTECTED]> writes:

> Eric,
>
> ioapic_level irq is limited, So if we keep vector number not changed
> when imgration to other cpus.  It that could help.

We can solve the problem without doing that, and keeping the same
vector number during migration keeps x86 from scaling.  Personally
I would prefer to disallow irq migration.

> it will need modify a little with assign_irq_vector and
> irq_complete_move/smp_irq_move_cleanup_interrupt. because it assume
> vector must be changed.

Yes it does.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Yinghai Lu

Eric,

ioapic_level irq is limited, So if we keep vector number not changed
when imgration to other cpus.  It that could help.

it will need modify a little with assign_irq_vector and
irq_complete_move/smp_irq_move_cleanup_interrupt. because it assume
vector must be changed.

YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> On Thu, May 17, 2007 at 05:30:13PM -0700, Eric W. Biederman wrote:
>> So why does any of this matter?
>>
>> My memory says that the ioapic state for sending irqs gets reset when we
>> unmask the irq.
>
> No. Atleast not on the platform we have tested.

Ok.  It simply have the test for an external interrupt getting reset
not the delivery state machine.  Otherwise you would not have a
problem in the first place.

>> If not I expect we can use the mask and edge, unmask and level
>> work-around from arch/i386/ioapic.c.  That looks like it would
>> be less code, less error prone and easier to implement then the
>> current work around.
>
> hmm. We have to check and see if it can help. But ioapic spec doesn't
> say that the Remote IRR gets reset when the trigger mode changes. Not sure
> if we can apply the logic widely.

Worst case it is a noop, so wide deployment should be safe.

The original definition of the Remote IRR is:

> 14 Remote IRR--RO. This bit is used for level triggered
>interrupts. Its meaning is undefined for edge triggered
>interrupts. For level triggered interrupts, this bit is set to 1
>when local APIC(s) accept the level interrupt sent by the
>IOAPIC. The Remote IRR bit is set to 0 when an EOI message with a
>matching interrupt vector is received from a local APIC.

We know it is not used to throttle edge triggered interrupts.

The question is how does the hardware implementation make that come
about?   By ignoring Remote IRR when transmitting edge triggered
interrupts?  By insisting Remote IRR is clear when doing a mode
transition?  By clearing Remote IRR when doing a mode transition?

At least historically there is some evidence that Intel's ioapics
did the latter.  I don't know if there is a defacto standard
for how ioapic mode transitions happen.

Still in any of those I don't see a problem with switching to edge
triggered mode and then back again.  Either Remote IRR will keep
it's current state or it will be cleared.   Remote IRR should not
get set (when it was clear) by such a procedure because that
would mess up the normal interrupt enable sequence that happens
on boot.  So I'm pretty certain toggling the edge bit is harmless
and it may actually clear Remote IRR for us.



This does seem to be a general applicable problem (at least in theory)
because we have no useful ordering rules we can take advantage of
so something that works across the board does make sense.

With a little care I think we can safely send an ack to the new
vector just in case this race triggers.  Before we send an
ipi to ourselves we need to ensure in the vector allocator that
the vector is either just for level or just for edge triggered
interrupts.  Then we can just send an ipi to our current cpu
ack it and we ensure the interrupt it acked.

"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> On Thu, May 17, 2007 at 04:58:02PM -0700, Eric W. Biederman wrote:
>> In essence this makes sense, and it may be the best work around for
>> buggy hardware available.   However I am not convinced that the remote
>> IRR on ioapics works reliably enough to be used for anything.   I
>> tested this earlier and I could not successfully poll the remote irr
>> bit to see if an ioapic had received an irq acknowledgement.  Instead
>> I locked up irq controllers.
>> 
>> If remote IRR worked reliably I could have pulled the irq migration
>> out of irq context.  So this fix looks dubious to me.
>
> We are just reading IOAPIC RTE's. So not sure why this can lead to lock
> up's. This patch is not reading any new HW registers. It just
> checks for certain bit values from the register that the code reads
> anyhow.
>
> Remote IRR should work reliably otherwise we will see lot more issues,
> I guess.

The test I performed was outside of interrupt context to mask an
interrupt, poll Remote IRR until it was clear, and them migrate
the interrupt.  That does not work correctly.  If that could
be made to work correctly we would not need to interrupt migration
in interrupt context.

So while that is not conclusive evidence that Remote IRR is not a
reliable indication of when it is safe to do things it is a strong
suggestion that it is not a good indicator.  Currently we don't
look at Remote IRR so we have no evidence that it is good for anything.

>> Why is the EOI delayed?  Can we work around that?
>
> Under some stress conditions, EOI can get retried because of which
> it is getting delayed.

Ok.  So that is why EOI is delayed.

>> It would be nice if ioapics and local apics actually obeyed the pci
>> ordering rules where a read would flush all traffic.  And we do have a
>> read in there.
>> 
>> I'm assuming the symptom you are seeing on current kernels is that
>> occasionally
>> the irq gets stuck and never fires again?
>
> yes.
>
>> I'm not certain I like the patch either, but I need to look more closely.
>> You are mixing changes to generic and arch specific code together.
>> 

Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Eric W. Biederman
Andi Kleen <[EMAIL PROTECTED]> writes:

> On Friday 18 May 2007 01:03, Siddha, Suresh B wrote:
>
>> Normally, the EOI generated by local APIC for level trigger interrupt
>> contains vector number. The IOAPIC will take this vector number and
>> search the IOAPIC RTE entries for an entry with matching vector number and
>> clear the remote IRR bit (indicate EOI). However, if the vector number is
>> changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
>> is received later. This will cause the remote IRR to get stuck causing the
>> interrupt hang (no more interrupt from this RTE).
>
> Does this happen often or did you only see it in some extreme or obscure
> case?

It is obscure.  Someone may have a case that makes it easy to
reproduce but it won't happen frequently.

So we should have enough time to review and fix this carefully,
instead of needing to rush a fix out.

>> +/*
>> + * If the EOI still didn't reach the RTE corresponding to the
>> + * level triggered irq, postpone the irq migration to the next
>> + * irq arrival event.
>> + */
>> +if (pending_eoi(irq)) {
>> +irq_desc[irq].status |= IRQ_MOVE_PENDING;
>> +return;
>
> Other code seems to have similar problems, but we don't have any lock
> protecting that bitmap against parallel updates outside the irq itself, don't 
> we? Perhaps it needs to be all set_bit()

Huh?  The irq lock is sufficient for this.

This is not a code software problem, this is a software hardware
interaction problem.  This is a problem in that there is no way to
guarantee that ioapics see a series of manipulations in the same
order that the cpu issues those manipulations.

In particular the ioapic is seeing the ack after we reprogram
the ioapic, which causes the ioapic to not accept the ack (because
it's vector number changed) and then the ioapic waits forever
for an ack that isn't coming.

The other half of the problem is pending_eoi() is testing a bit in the
hardware that in my experience is not sufficient to guarantee that
the hardware is in a state where action may be taken.

This is a very delicate problem because there is little if any guaranteed
ordering between cpu operations and ioapic operations, especially because
there are two channels of communication the ``apic bus'' and the memory
mapped ioapic registers.  So what happens on one channel has no ordering
requirements of any kind with what happens on the other channel.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-18 Thread Andi Kleen
On Friday 18 May 2007 01:03, Siddha, Suresh B wrote:

> Normally, the EOI generated by local APIC for level trigger interrupt
> contains vector number. The IOAPIC will take this vector number and
> search the IOAPIC RTE entries for an entry with matching vector number and
> clear the remote IRR bit (indicate EOI). However, if the vector number is
> changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
> is received later. This will cause the remote IRR to get stuck causing the
> interrupt hang (no more interrupt from this RTE).

Does this happen often or did you only see it in some extreme or obscure
case?

> + /*
> +  * If the EOI still didn't reach the RTE corresponding to the
> +  * level triggered irq, postpone the irq migration to the next
> +  * irq arrival event.
> +  */
> + if (pending_eoi(irq)) {
> + irq_desc[irq].status |= IRQ_MOVE_PENDING;
> + return;

Other code seems to have similar problems, but we don't have any lock
protecting that bitmap against parallel updates outside the irq itself, don't 
we? Perhaps it needs to be all set_bit()

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-17 Thread Siddha, Suresh B
On Thu, May 17, 2007 at 05:30:13PM -0700, Eric W. Biederman wrote:
> So why does any of this matter?
>
> My memory says that the ioapic state for sending irqs gets reset when we
> unmask the irq.

No. Atleast not on the platform we have tested.

> If not I expect we can use the mask and edge, unmask and level
> work-around from arch/i386/ioapic.c.  That looks like it would
> be less code, less error prone and easier to implement then the
> current work around.

hmm. We have to check and see if it can help. But ioapic spec doesn't
say that the Remote IRR gets reset when the trigger mode changes. Not sure
if we can apply the logic widely.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-17 Thread Siddha, Suresh B
On Thu, May 17, 2007 at 04:58:02PM -0700, Eric W. Biederman wrote:
> In essence this makes sense, and it may be the best work around for
> buggy hardware available.   However I am not convinced that the remote
> IRR on ioapics works reliably enough to be used for anything.   I
> tested this earlier and I could not successfully poll the remote irr
> bit to see if an ioapic had received an irq acknowledgement.  Instead
> I locked up irq controllers.
> 
> If remote IRR worked reliably I could have pulled the irq migration
> out of irq context.  So this fix looks dubious to me.

We are just reading IOAPIC RTE's. So not sure why this can lead to lock
up's. This patch is not reading any new HW registers. It just
checks for certain bit values from the register that the code reads
anyhow.

Remote IRR should work reliably otherwise we will see lot more issues,
I guess.

> 
> Why is the EOI delayed?  Can we work around that?

Under some stress conditions, EOI can get retried because of which
it is getting delayed.

> 
> It would be nice if ioapics and local apics actually obeyed the pci
> ordering rules where a read would flush all traffic.  And we do have a
> read in there.
> 
> I'm assuming the symptom you are seeing on current kernels is that
> occasionally
> the irq gets stuck and never fires again?

yes.

> I'm not certain I like the patch either, but I need to look more closely.
> You are mixing changes to generic and arch specific code together.
> 
> I think pending_eoi should not try to reuse __DO_ACTION as that
> helper bit of code does not seem appropriate.

Wanted to minimize the code duplicacy and hence overloaded existing
macros.

> 
> It would probably be best if the pending_eoi check was in
> ack_apic_level with the rest of the weird logic we are working around.

Just wanted to make sure that we give some time for the EOI to actually
reach the IO-APIC.

> 
> Could we please have more detail on this hardware behavior.  Why is
> the EIO write write delayed?   Why can't we just issue reads in
> appropriate places to flush the write?

Because the EOI to IOAPIC gets generated as a side effect of processor
EOI.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-17 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> on x86_64 kernel, level triggered irq migration gets initiated in the context
> of that interrupt(after executing the irq handler) and following steps are
> followed to do the irq migration.
>
> 1. mask IOAPIC RTE entry; // write to IOAPIC RTE
> 2. EOI;   // processor EOI write
> 3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
>   // and interrupt vector due to per cpu vector
>   // allocation.
> 4. unmask IOAPIC RTE entry;   // write to IOAPIC RTE
>
> Because of the per cpu vector allocation in x86_64 kernels, when the irq
> migrates to a different cpu, new vector(corresponding to the new cpu) will
> get allocated.
>
> An EOI write to local APIC has a side effect of generating an EOI write 
> for level trigger interrupts (normally this is a broadcast to all IOAPICs). 
> The EOI broadcast generated as a side effect of EOI write to processor may 
> be delayed while the other IOAPIC writes (step 3 and 4) can go through.
>
> Normally, the EOI generated by local APIC for level trigger interrupt
> contains vector number. The IOAPIC will take this vector number and
> search the IOAPIC RTE entries for an entry with matching vector number and
> clear the remote IRR bit (indicate EOI). However, if the vector number is
> changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
> is received later. This will cause the remote IRR to get stuck causing the
> interrupt hang (no more interrupt from this RTE).
>
> Current x86_64 kernel assumes that remote IRR bit is cleared by the time
> IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
> bit and if it still set, delay the irq migration to the next interrupt
> arrival event(hopefully, next time remote IRR bit will get cleared
> before the IOAPIC RTE is reprogrammed).
>
> Initial analysis and patch from Nanhai.

So why does any of this matter? 

My memory says that the ioapic state for sending irqs gets reset when we
unmask the irq.

If not I expect we can use the mask and edge, unmask and level
work-around from arch/i386/ioapic.c.  That looks like it would
be less code, less error prone and easier to implement then the
current work around.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86_64, irq: check remote IRR bit before migrating level triggered irq

2007-05-17 Thread Eric W. Biederman
"Siddha, Suresh B" <[EMAIL PROTECTED]> writes:

> on x86_64 kernel, level triggered irq migration gets initiated in the context
> of that interrupt(after executing the irq handler) and following steps are
> followed to do the irq migration.
>
> 1. mask IOAPIC RTE entry; // write to IOAPIC RTE
> 2. EOI;   // processor EOI write
> 3. reprogram IOAPIC RTE entry // write to IOAPIC RTE with new destination and
>   // and interrupt vector due to per cpu vector
>   // allocation.
> 4. unmask IOAPIC RTE entry;   // write to IOAPIC RTE
>
> Because of the per cpu vector allocation in x86_64 kernels, when the irq
> migrates to a different cpu, new vector(corresponding to the new cpu) will
> get allocated.
>
> An EOI write to local APIC has a side effect of generating an EOI write 
> for level trigger interrupts (normally this is a broadcast to all IOAPICs). 
> The EOI broadcast generated as a side effect of EOI write to processor may 
> be delayed while the other IOAPIC writes (step 3 and 4) can go through.
>
> Normally, the EOI generated by local APIC for level trigger interrupt
> contains vector number. The IOAPIC will take this vector number and
> search the IOAPIC RTE entries for an entry with matching vector number and
> clear the remote IRR bit (indicate EOI). However, if the vector number is
> changed (as in step 3) the IOAPIC will not find the RTE entry when the EOI
> is received later. This will cause the remote IRR to get stuck causing the
> interrupt hang (no more interrupt from this RTE).
>
> Current x86_64 kernel assumes that remote IRR bit is cleared by the time
> IOAPIC RTE is reprogrammed. Fix this assumption by checking for remote IRR
> bit and if it still set, delay the irq migration to the next interrupt
> arrival event(hopefully, next time remote IRR bit will get cleared
> before the IOAPIC RTE is reprogrammed).
>
> Initial analysis and patch from Nanhai.

In essence this makes sense, and it may be the best work around for
buggy hardware available.   However I am not convinced that the remote
IRR on ioapics works reliably enough to be used for anything.   I
tested this earlier and I could not successfully poll the remote irr
bit to see if an ioapic had received an irq acknowledgement.  Instead
I locked up irq controllers.

If remote IRR worked reliably I could have pulled the irq migration
out of irq context.  So this fix looks dubious to me.

Why is the EOI delayed?  Can we work around that?

It would be nice if ioapics and local apics actually obeyed the pci
ordering rules where a read would flush all traffic.  And we do have a
read in there.

I'm assuming the symptom you are seeing on current kernels is that occasionally
the irq gets stuck and never fires again?


I'm not certain I like the patch either, but I need to look more closely.
You are mixing changes to generic and arch specific code together.

I think pending_eoi should not try to reuse __DO_ACTION as that
helper bit of code does not seem appropriate.

It would probably be best if the pending_eoi check was in
ack_apic_level with the rest of the weird logic we are working around.

Could we please have more detail on this hardware behavior.  Why is
the EIO write write delayed?   Why can't we just issue reads in
appropriate places to flush the write?

I need to think about this some more to see if there are any other
possible ways we could address this issue that would be more robust.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/