Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote:
 2007/8/9, Jarek Poplawski [EMAIL PROTECTED]:
...
  diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
  --- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
  +++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200
  @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
  unsigned int cpu = smp_processor_id();
  struct irqaction *action;
  irqreturn_t action_ret;
  +   int edge = 0;
 
...
 NETDEV WATCHDOG: eth0: transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=351.
 eth0: Resetting the 8390 t=4295229000...6NETDEV WATCHDOG: eth0:
 transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=718.
 eth0: Resetting the 8390 t=429523...6NETDEV WATCHDOG: eth0:
 transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=874.
 etc...

So, we still have to wait for the exact explanation...

Thanks very much Marcin!

I think, there is this one possible for your testing yet?:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

If it's not a great problem it would be interesting to try this with
different CONFIG_HZ too e.g. you could start with 100 (I guess, you
tested very similar thing in 2.6.23-rc2 with 1000(?) already).

Jean-Baptiste: you can skip/break testing of this 'experimental'
patch, too.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote:
 2007/8/10, Jarek Poplawski [EMAIL PROTECTED]:
  (..)
  I think, there is this one possible for your testing yet?:
  Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
  Date: Wed, 8 Aug 2007 13:00:37 +0200
 I think I already tested this patch, but this thread is sooo big and I
 can't find my response...

I think it was very similar Ingo's patch, which after your testing
is in 2.6.23-rc2 now. I've moved return for level type irqs a little
later, and it works a new way only for x86_64.

But, I think, now this patch is less important: if you find some time,
try mostly new Ingo's or Thomas' patches (latest possible versions).

 
  If it's not a great problem it would be interesting to try this with
  different CONFIG_HZ too e.g. you could start with 100 (I guess, you
  tested very similar thing in 2.6.23-rc2 with 1000(?) already).
 My all tests were done on 2.6.22.1

Fine! If it's not said otherwise this version should be appropriate for
most of these patches. But, please, send your current configs on next
posibility (dmesg, /proc/interrupts and .config).

Bye (till monday),
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Marcin Ślusarz
2007/8/10, Jarek Poplawski [EMAIL PROTECTED]:
 (..)
 I think, there is this one possible for your testing yet?:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
I think I already tested this patch, but this thread is sooo big and I
can't find my response...

 If it's not a great problem it would be interesting to try this with
 different CONFIG_HZ too e.g. you could start with 100 (I guess, you
 tested very similar thing in 2.6.23-rc2 with 1000(?) already).
My all tests were done on 2.6.22.1

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On 10-08-2007 10:05, Thomas Gleixner wrote:
  ...
   But suppressing the resend is not fixing the driver problem. The 
   problem can show up with spurious interrupts and with interrupts on 
   a shared PCI interrupt line at any time. It just might take weeks 
   instead of minutes.
  
  Maybe I miss something but it's not the same!
 
 _now_ i finally understand what you probably meant: because sw-resend 
 worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
 not any driver or irqflow bug - correct?

All correct! There was also checked a possibility it can be not
hw itself, but wrong way of handling after hw (acking too late). This
was false idea (or bad implementation), so it looks like hw vs lapic
problem.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 10-08-2007 10:05, Thomas Gleixner wrote:
 ...
  But suppressing the resend is not fixing the driver problem. The 
  problem can show up with spurious interrupts and with interrupts on 
  a shared PCI interrupt line at any time. It just might take weeks 
  instead of minutes.
 
 Maybe I miss something but it's not the same!

_now_ i finally understand what you probably meant: because sw-resend 
worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
not any driver or irqflow bug - correct?

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
 So, we still have to wait for the exact explanation...
 
 Thanks very much Marcin!
 
 I think, there is this one possible for your testing yet?:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
 
 If it's not a great problem it would be interesting to try this with
 different CONFIG_HZ too e.g. you could start with 100 (I guess, you
 tested very similar thing in 2.6.23-rc2 with 1000(?) already).
 
 Jean-Baptiste: you can skip/break testing of this 'experimental'

ok

I was still testing on -rc2:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

For me after 1day 20hours, the network is still up, with more than 1To 
of network traffic. HZ was 1000, i restart with HZ=100.

Jb


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
...
 I was still testing on -rc2:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
 
 For me after 1day 20hours, the network is still up, with more than 1To
 of network traffic. HZ was 1000, i restart with HZ=100.

For me it's enough too but Thomas seems to doubt.

You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
prepared too. So, if this is not a great problem maybe you could try
this first. Tomorrow Thomas may send something, so this 100HZ could
wait yet, I hope?

Many thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
 For me it's enough too but Thomas seems to doubt.
 
 You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
 prepared too. So, if this is not a great problem maybe you could try
 this first. Tomorrow Thomas may send something, so this 100HZ could
 wait yet, I hope?

Ok, i'll test 2.6.23-rc1 with HARDIRQS_SW_RESEND first.

Jb

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
 ...
  I was still testing on -rc2:
  Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
  Date: Wed, 8 Aug 2007 13:00:37 +0200
  
  For me after 1day 20hours, the network is still up, with more than 
  1To of network traffic. HZ was 1000, i restart with HZ=100.
 
 For me it's enough too but Thomas seems to doubt.

seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
and we never doubted that. I think you might have misunderstood what 
Thomas said and meant, so please just state your opinion unambiguously 
so that we can fix any mis-communication :)

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
  ...
   I was still testing on -rc2:
   Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
   Date: Wed, 8 Aug 2007 13:00:37 +0200
   
   For me after 1day 20hours, the network is still up, with more than 
   1To of network traffic. HZ was 1000, i restart with HZ=100.
  
  For me it's enough too but Thomas seems to doubt.
 
 seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
 and we never doubted that. I think you might have misunderstood what 
 Thomas said and meant, so please just state your opinion unambiguously 
 so that we can fix any mis-communication :)
 
   Ingo
 


On 25-07-2007 02:19, Thomas Gleixner wrote:
...
 Actually we only need the resend for edge type interrupts. Level type
 interrupts come back once enable_irq() re-enables the interrupt line.


On 10-08-2007 10:05, Thomas Gleixner wrote:
...
 But suppressing the resend is not fixing the driver problem. The problem
 can show up with spurious interrupts and with interrupts on a shared PCI
 interrupt line at any time. It just might take weeks instead of minutes.

Maybe I miss something but it's not the same!

So, should Jean-Baptiste or Marcin test this for weeks or it's enough?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 All correct! There was also checked a possibility it can be not hw 
 itself, but wrong way of handling after hw (acking too late). This was 
 false idea (or bad implementation), so it looks like hw vs lapic 
 problem.

i think the problem is that local APIC 'self vectors' might be 
edge-triggered by default. I'm not exactly sure whether passing in 
APIC_INT_LEVELTRIG to send_IPI_self() will truly be interpreted by the 
local APIC into any external IO-APIC ACK sequence (the local APIC might 
just treat self-vectors as always-edge) - and it might also be that the 
pure act of mixing self-triggered vectors with level-triggered external 
irqs sometimes confuses the IO-APIC - local-APIC messaging. One more 
test of the patch below will tell us a bit more about this part of the 
story.

Ingo

Index: linux/arch/i386/kernel/io_apic.c
===
--- linux.orig/arch/i386/kernel/io_apic.c
+++ linux/arch/i386/kernel/io_apic.c
@@ -735,7 +735,8 @@ void fastcall send_IPI_self(int vector)
 * Wait for idle.
 */
apic_wait_icr_idle();
-   cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL;
+   cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL |
+   APIC_INT_LEVELTRIG;
/*
 * Send the IPI. The write to APIC_ICR fires this off.
 */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
 Read below please:
 
 On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
  2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
   So, the let's try this idea yet: modified Ingo's x86: activate
   HARDIRQS_SW_RESEND patch.
   (Don't forget about make oldconfig before make.)
   For testing only.
...
   diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
   --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200
   +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.0 +0200
   @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
   depends on GENERIC_HARDIRQS  SMP
   default y
  
   +config HARDIRQS_SW_RESEND
...
  Works fine with:
 
 Very nice! It would be about time this kernel should start behave...
 
  WARNING: at kernel/irq/resend.c:79 check_irq_resend()
  
  Call Trace:
...
 So, it looks like x86_64 io_apic's IPI code was unused too long...
 I hope it's a piece of cake for Ingo now...

So, we know now it's almost definitely something about lapic and IPIs
but, maybe it's not this code to blame...

Here is one more patch to check the possibility it's about the way
the resend edge type irqs are handled by level type handlers:
so, let's check if acking isn't too late...

Marcin and Jean-Baptiste: I would be very glad, as usual! And no need
to hurry; I think we know enough to fix this for you, but maybe this
test could explain if there are errors in lapics or only bad handling.

Many thanks,
Jarek P. 

PS: this patch is very experimental, and only intended for testing.
It should be applied to clean 2.6.23-rc1 or a bit older (eg. 2.6.22)
(so 2.6.23-rc2 or any patches from this thread shouldn't be around) 

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200
@@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
unsigned int cpu = smp_processor_id();
struct irqaction *action;
irqreturn_t action_ret;
+   int edge = 0;
 
spin_lock(desc-lock);
 
if (unlikely(desc-status  IRQ_INPROGRESS))
goto out;
 
+   if ((desc-status  (IRQ_PENDING | IRQ_REPLAY)) ==
+   IRQ_REPLAY) {
+   desc-chip-ack(irq);
+   edge = 1;
+   }
+
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;
 
@@ -421,7 +428,8 @@ handle_fasteoi_irq(unsigned int irq, str
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
 out:
-   desc-chip-eoi(irq);
+   if (!edge)
+   desc-chip-eoi(irq);
 
spin_unlock(desc-lock);
 }
diff -Nurp 2.6.23-rc1-/kernel/irq/resend.c 2.6.23-rc1/kernel/irq/resend.c
--- 2.6.23-rc1-/kernel/irq/resend.c 2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/resend.c  2007-08-08 20:44:14.0 +0200
@@ -57,14 +57,10 @@ void check_irq_resend(struct irq_desc *d
 {
unsigned int status = desc-status;
 
-   /*
-* Make sure the interrupt is enabled, before resending it:
-*/
-   desc-chip-enable(irq);
-
if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 
+   WARN_ON_ONCE(1);
if (!desc-chip || !desc-chip-retrigger ||
!desc-chip-retrigger(irq)) {
 #ifdef CONFIG_HARDIRQS_SW_RESEND
@@ -74,4 +70,5 @@ void check_irq_resend(struct irq_desc *d
 #endif
}
}
+   desc-chip-enable(irq);
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html