Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote: 2007/8/9, Jarek Poplawski [EMAIL PROTECTED]: ... diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.0 +0200 +++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200 @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str unsigned int cpu = smp_processor_id(); struct irqaction *action; irqreturn_t action_ret; + int edge = 0; ... NETDEV WATCHDOG: eth0: transmit timed out eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=351. eth0: Resetting the 8390 t=4295229000...6NETDEV WATCHDOG: eth0: transmit timed out eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=718. eth0: Resetting the 8390 t=429523...6NETDEV WATCHDOG: eth0: transmit timed out eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=874. etc... So, we still have to wait for the exact explanation... Thanks very much Marcin! I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 If it's not a great problem it would be interesting to try this with different CONFIG_HZ too e.g. you could start with 100 (I guess, you tested very similar thing in 2.6.23-rc2 with 1000(?) already). Jean-Baptiste: you can skip/break testing of this 'experimental' patch, too. Regards, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote: 2007/8/10, Jarek Poplawski [EMAIL PROTECTED]: (..) I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 I think I already tested this patch, but this thread is sooo big and I can't find my response... I think it was very similar Ingo's patch, which after your testing is in 2.6.23-rc2 now. I've moved return for level type irqs a little later, and it works a new way only for x86_64. But, I think, now this patch is less important: if you find some time, try mostly new Ingo's or Thomas' patches (latest possible versions). If it's not a great problem it would be interesting to try this with different CONFIG_HZ too e.g. you could start with 100 (I guess, you tested very similar thing in 2.6.23-rc2 with 1000(?) already). My all tests were done on 2.6.22.1 Fine! If it's not said otherwise this version should be appropriate for most of these patches. But, please, send your current configs on next posibility (dmesg, /proc/interrupts and .config). Bye (till monday), Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
2007/8/10, Jarek Poplawski [EMAIL PROTECTED]: (..) I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 I think I already tested this patch, but this thread is sooo big and I can't find my response... If it's not a great problem it would be interesting to try this with different CONFIG_HZ too e.g. you could start with 100 (I guess, you tested very similar thing in 2.6.23-rc2 with 1000(?) already). My all tests were done on 2.6.22.1 Marcin - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote: * Jarek Poplawski [EMAIL PROTECTED] wrote: On 10-08-2007 10:05, Thomas Gleixner wrote: ... But suppressing the resend is not fixing the driver problem. The problem can show up with spurious interrupts and with interrupts on a shared PCI interrupt line at any time. It just might take weeks instead of minutes. Maybe I miss something but it's not the same! _now_ i finally understand what you probably meant: because sw-resend worked and hw-resend didnt, it's hw-resend that is causing the breakage, not any driver or irqflow bug - correct? All correct! There was also checked a possibility it can be not hw itself, but wrong way of handling after hw (acking too late). This was false idea (or bad implementation), so it looks like hw vs lapic problem. Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
* Jarek Poplawski [EMAIL PROTECTED] wrote: On 10-08-2007 10:05, Thomas Gleixner wrote: ... But suppressing the resend is not fixing the driver problem. The problem can show up with spurious interrupts and with interrupts on a shared PCI interrupt line at any time. It just might take weeks instead of minutes. Maybe I miss something but it's not the same! _now_ i finally understand what you probably meant: because sw-resend worked and hw-resend didnt, it's hw-resend that is causing the breakage, not any driver or irqflow bug - correct? Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
So, we still have to wait for the exact explanation... Thanks very much Marcin! I think, there is this one possible for your testing yet?: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 If it's not a great problem it would be interesting to try this with different CONFIG_HZ too e.g. you could start with 100 (I guess, you tested very similar thing in 2.6.23-rc2 with 1000(?) already). Jean-Baptiste: you can skip/break testing of this 'experimental' ok I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours, the network is still up, with more than 1To of network traffic. HZ was 1000, i restart with HZ=100. Jb - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours, the network is still up, with more than 1To of network traffic. HZ was 1000, i restart with HZ=100. For me it's enough too but Thomas seems to doubt. You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND prepared too. So, if this is not a great problem maybe you could try this first. Tomorrow Thomas may send something, so this 100HZ could wait yet, I hope? Many thanks, Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
For me it's enough too but Thomas seems to doubt. You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND prepared too. So, if this is not a great problem maybe you could try this first. Tomorrow Thomas may send something, so this 100HZ could wait yet, I hope? Ok, i'll test 2.6.23-rc1 with HARDIRQS_SW_RESEND first. Jb - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
* Jarek Poplawski [EMAIL PROTECTED] wrote: On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours, the network is still up, with more than 1To of network traffic. HZ was 1000, i restart with HZ=100. For me it's enough too but Thomas seems to doubt. seem to doubt what? That rc2 fixes the symptom? That is a sure thing, and we never doubted that. I think you might have misunderstood what Thomas said and meant, so please just state your opinion unambiguously so that we can fix any mis-communication :) Ingo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote: * Jarek Poplawski [EMAIL PROTECTED] wrote: On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote: ... I was still testing on -rc2: Subject: [patch] genirq: temporary fix for level-triggered IRQ resend Date: Wed, 8 Aug 2007 13:00:37 +0200 For me after 1day 20hours, the network is still up, with more than 1To of network traffic. HZ was 1000, i restart with HZ=100. For me it's enough too but Thomas seems to doubt. seem to doubt what? That rc2 fixes the symptom? That is a sure thing, and we never doubted that. I think you might have misunderstood what Thomas said and meant, so please just state your opinion unambiguously so that we can fix any mis-communication :) Ingo On 25-07-2007 02:19, Thomas Gleixner wrote: ... Actually we only need the resend for edge type interrupts. Level type interrupts come back once enable_irq() re-enables the interrupt line. On 10-08-2007 10:05, Thomas Gleixner wrote: ... But suppressing the resend is not fixing the driver problem. The problem can show up with spurious interrupts and with interrupts on a shared PCI interrupt line at any time. It just might take weeks instead of minutes. Maybe I miss something but it's not the same! So, should Jean-Baptiste or Marcin test this for weeks or it's enough? Jarek P. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
* Jarek Poplawski [EMAIL PROTECTED] wrote: All correct! There was also checked a possibility it can be not hw itself, but wrong way of handling after hw (acking too late). This was false idea (or bad implementation), so it looks like hw vs lapic problem. i think the problem is that local APIC 'self vectors' might be edge-triggered by default. I'm not exactly sure whether passing in APIC_INT_LEVELTRIG to send_IPI_self() will truly be interpreted by the local APIC into any external IO-APIC ACK sequence (the local APIC might just treat self-vectors as always-edge) - and it might also be that the pure act of mixing self-triggered vectors with level-triggered external irqs sometimes confuses the IO-APIC - local-APIC messaging. One more test of the patch below will tell us a bit more about this part of the story. Ingo Index: linux/arch/i386/kernel/io_apic.c === --- linux.orig/arch/i386/kernel/io_apic.c +++ linux/arch/i386/kernel/io_apic.c @@ -735,7 +735,8 @@ void fastcall send_IPI_self(int vector) * Wait for idle. */ apic_wait_icr_idle(); - cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL; + cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL | + APIC_INT_LEVELTRIG; /* * Send the IPI. The write to APIC_ICR fires this off. */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote: Read below please: On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote: 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]: So, the let's try this idea yet: modified Ingo's x86: activate HARDIRQS_SW_RESEND patch. (Don't forget about make oldconfig before make.) For testing only. ... diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200 +++ 2.6.22.1/arch/i386/Kconfig 2007-08-07 13:13:03.0 +0200 @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ depends on GENERIC_HARDIRQS SMP default y +config HARDIRQS_SW_RESEND ... Works fine with: Very nice! It would be about time this kernel should start behave... WARNING: at kernel/irq/resend.c:79 check_irq_resend() Call Trace: ... So, it looks like x86_64 io_apic's IPI code was unused too long... I hope it's a piece of cake for Ingo now... So, we know now it's almost definitely something about lapic and IPIs but, maybe it's not this code to blame... Here is one more patch to check the possibility it's about the way the resend edge type irqs are handled by level type handlers: so, let's check if acking isn't too late... Marcin and Jean-Baptiste: I would be very glad, as usual! And no need to hurry; I think we know enough to fix this for you, but maybe this test could explain if there are errors in lapics or only bad handling. Many thanks, Jarek P. PS: this patch is very experimental, and only intended for testing. It should be applied to clean 2.6.23-rc1 or a bit older (eg. 2.6.22) (so 2.6.23-rc2 or any patches from this thread shouldn't be around) --- diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.0 +0200 +++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200 @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str unsigned int cpu = smp_processor_id(); struct irqaction *action; irqreturn_t action_ret; + int edge = 0; spin_lock(desc-lock); if (unlikely(desc-status IRQ_INPROGRESS)) goto out; + if ((desc-status (IRQ_PENDING | IRQ_REPLAY)) == + IRQ_REPLAY) { + desc-chip-ack(irq); + edge = 1; + } + desc-status = ~(IRQ_REPLAY | IRQ_WAITING); kstat_cpu(cpu).irqs[irq]++; @@ -421,7 +428,8 @@ handle_fasteoi_irq(unsigned int irq, str spin_lock(desc-lock); desc-status = ~IRQ_INPROGRESS; out: - desc-chip-eoi(irq); + if (!edge) + desc-chip-eoi(irq); spin_unlock(desc-lock); } diff -Nurp 2.6.23-rc1-/kernel/irq/resend.c 2.6.23-rc1/kernel/irq/resend.c --- 2.6.23-rc1-/kernel/irq/resend.c 2007-07-09 01:32:17.0 +0200 +++ 2.6.23-rc1/kernel/irq/resend.c 2007-08-08 20:44:14.0 +0200 @@ -57,14 +57,10 @@ void check_irq_resend(struct irq_desc *d { unsigned int status = desc-status; - /* -* Make sure the interrupt is enabled, before resending it: -*/ - desc-chip-enable(irq); - if ((status (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) { desc-status = (status ~IRQ_PENDING) | IRQ_REPLAY; + WARN_ON_ONCE(1); if (!desc-chip || !desc-chip-retrigger || !desc-chip-retrigger(irq)) { #ifdef CONFIG_HARDIRQS_SW_RESEND @@ -74,4 +70,5 @@ void check_irq_resend(struct irq_desc *d #endif } } + desc-chip-enable(irq); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html