Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 08:33:27AM +0200, Marcin Ślusarz wrote:
 2007/8/9, Jarek Poplawski [EMAIL PROTECTED]:
...
  diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
  --- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
  +++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200
  @@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
  unsigned int cpu = smp_processor_id();
  struct irqaction *action;
  irqreturn_t action_ret;
  +   int edge = 0;
 
...
 NETDEV WATCHDOG: eth0: transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=351.
 eth0: Resetting the 8390 t=4295229000...6NETDEV WATCHDOG: eth0:
 transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=718.
 eth0: Resetting the 8390 t=429523...6NETDEV WATCHDOG: eth0:
 transmit timed out
 eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=874.
 etc...

So, we still have to wait for the exact explanation...

Thanks very much Marcin!

I think, there is this one possible for your testing yet?:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

If it's not a great problem it would be interesting to try this with
different CONFIG_HZ too e.g. you could start with 100 (I guess, you
tested very similar thing in 2.6.23-rc2 with 1000(?) already).

Jean-Baptiste: you can skip/break testing of this 'experimental'
patch, too.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 12:43:43PM +0200, Marcin Ślusarz wrote:
 2007/8/10, Jarek Poplawski [EMAIL PROTECTED]:
  (..)
  I think, there is this one possible for your testing yet?:
  Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
  Date: Wed, 8 Aug 2007 13:00:37 +0200
 I think I already tested this patch, but this thread is sooo big and I
 can't find my response...

I think it was very similar Ingo's patch, which after your testing
is in 2.6.23-rc2 now. I've moved return for level type irqs a little
later, and it works a new way only for x86_64.

But, I think, now this patch is less important: if you find some time,
try mostly new Ingo's or Thomas' patches (latest possible versions).

 
  If it's not a great problem it would be interesting to try this with
  different CONFIG_HZ too e.g. you could start with 100 (I guess, you
  tested very similar thing in 2.6.23-rc2 with 1000(?) already).
 My all tests were done on 2.6.22.1

Fine! If it's not said otherwise this version should be appropriate for
most of these patches. But, please, send your current configs on next
posibility (dmesg, /proc/interrupts and .config).

Bye (till monday),
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Marcin Ślusarz
2007/8/10, Jarek Poplawski [EMAIL PROTECTED]:
 (..)
 I think, there is this one possible for your testing yet?:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
I think I already tested this patch, but this thread is sooo big and I
can't find my response...

 If it's not a great problem it would be interesting to try this with
 different CONFIG_HZ too e.g. you could start with 100 (I guess, you
 tested very similar thing in 2.6.23-rc2 with 1000(?) already).
My all tests were done on 2.6.22.1

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 11:08:33AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On 10-08-2007 10:05, Thomas Gleixner wrote:
  ...
   But suppressing the resend is not fixing the driver problem. The 
   problem can show up with spurious interrupts and with interrupts on 
   a shared PCI interrupt line at any time. It just might take weeks 
   instead of minutes.
  
  Maybe I miss something but it's not the same!
 
 _now_ i finally understand what you probably meant: because sw-resend 
 worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
 not any driver or irqflow bug - correct?

All correct! There was also checked a possibility it can be not
hw itself, but wrong way of handling after hw (acking too late). This
was false idea (or bad implementation), so it looks like hw vs lapic
problem.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 10-08-2007 10:05, Thomas Gleixner wrote:
 ...
  But suppressing the resend is not fixing the driver problem. The 
  problem can show up with spurious interrupts and with interrupts on 
  a shared PCI interrupt line at any time. It just might take weeks 
  instead of minutes.
 
 Maybe I miss something but it's not the same!

_now_ i finally understand what you probably meant: because sw-resend 
worked and hw-resend didnt, it's hw-resend that is causing the breakage, 
not any driver or irqflow bug - correct?

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
 So, we still have to wait for the exact explanation...
 
 Thanks very much Marcin!
 
 I think, there is this one possible for your testing yet?:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
 
 If it's not a great problem it would be interesting to try this with
 different CONFIG_HZ too e.g. you could start with 100 (I guess, you
 tested very similar thing in 2.6.23-rc2 with 1000(?) already).
 
 Jean-Baptiste: you can skip/break testing of this 'experimental'

ok

I was still testing on -rc2:
Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
Date: Wed, 8 Aug 2007 13:00:37 +0200

For me after 1day 20hours, the network is still up, with more than 1To 
of network traffic. HZ was 1000, i restart with HZ=100.

Jb


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
...
 I was still testing on -rc2:
 Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
 Date: Wed, 8 Aug 2007 13:00:37 +0200
 
 For me after 1day 20hours, the network is still up, with more than 1To
 of network traffic. HZ was 1000, i restart with HZ=100.

For me it's enough too but Thomas seems to doubt.

You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
prepared too. So, if this is not a great problem maybe you could try
this first. Tomorrow Thomas may send something, so this 100HZ could
wait yet, I hope?

Many thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jean-Baptiste Vignaud
 For me it's enough too but Thomas seems to doubt.
 
 You've written earlier that you've 2.6.23-rc1 with HARDIRQS_SW_RESEND
 prepared too. So, if this is not a great problem maybe you could try
 this first. Tomorrow Thomas may send something, so this 100HZ could
 wait yet, I hope?

Ok, i'll test 2.6.23-rc1 with HARDIRQS_SW_RESEND first.

Jb

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
 ...
  I was still testing on -rc2:
  Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
  Date: Wed, 8 Aug 2007 13:00:37 +0200
  
  For me after 1day 20hours, the network is still up, with more than 
  1To of network traffic. HZ was 1000, i restart with HZ=100.
 
 For me it's enough too but Thomas seems to doubt.

seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
and we never doubted that. I think you might have misunderstood what 
Thomas said and meant, so please just state your opinion unambiguously 
so that we can fix any mis-communication :)

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Jarek Poplawski
On Fri, Aug 10, 2007 at 10:48:41AM +0200, Ingo Molnar wrote:
 
 * Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On Fri, Aug 10, 2007 at 10:15:53AM +0200, Jean-Baptiste Vignaud wrote:
  ...
   I was still testing on -rc2:
   Subject: [patch] genirq: temporary fix for level-triggered IRQ resend
   Date: Wed, 8 Aug 2007 13:00:37 +0200
   
   For me after 1day 20hours, the network is still up, with more than 
   1To of network traffic. HZ was 1000, i restart with HZ=100.
  
  For me it's enough too but Thomas seems to doubt.
 
 seem to doubt what? That rc2 fixes the symptom? That is a sure thing, 
 and we never doubted that. I think you might have misunderstood what 
 Thomas said and meant, so please just state your opinion unambiguously 
 so that we can fix any mis-communication :)
 
   Ingo
 


On 25-07-2007 02:19, Thomas Gleixner wrote:
...
 Actually we only need the resend for edge type interrupts. Level type
 interrupts come back once enable_irq() re-enables the interrupt line.


On 10-08-2007 10:05, Thomas Gleixner wrote:
...
 But suppressing the resend is not fixing the driver problem. The problem
 can show up with spurious interrupts and with interrupts on a shared PCI
 interrupt line at any time. It just might take weeks instead of minutes.

Maybe I miss something but it's not the same!

So, should Jean-Baptiste or Marcin test this for weeks or it's enough?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-10 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 All correct! There was also checked a possibility it can be not hw 
 itself, but wrong way of handling after hw (acking too late). This was 
 false idea (or bad implementation), so it looks like hw vs lapic 
 problem.

i think the problem is that local APIC 'self vectors' might be 
edge-triggered by default. I'm not exactly sure whether passing in 
APIC_INT_LEVELTRIG to send_IPI_self() will truly be interpreted by the 
local APIC into any external IO-APIC ACK sequence (the local APIC might 
just treat self-vectors as always-edge) - and it might also be that the 
pure act of mixing self-triggered vectors with level-triggered external 
irqs sometimes confuses the IO-APIC - local-APIC messaging. One more 
test of the patch below will tell us a bit more about this part of the 
story.

Ingo

Index: linux/arch/i386/kernel/io_apic.c
===
--- linux.orig/arch/i386/kernel/io_apic.c
+++ linux/arch/i386/kernel/io_apic.c
@@ -735,7 +735,8 @@ void fastcall send_IPI_self(int vector)
 * Wait for idle.
 */
apic_wait_icr_idle();
-   cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL;
+   cfg = APIC_DM_FIXED | APIC_DEST_SELF | vector | APIC_DEST_LOGICAL |
+   APIC_INT_LEVELTRIG;
/*
 * Send the IPI. The write to APIC_ICR fires this off.
 */
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch (testing)] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
 Read below please:
 
 On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
  2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
   So, the let's try this idea yet: modified Ingo's x86: activate
   HARDIRQS_SW_RESEND patch.
   (Don't forget about make oldconfig before make.)
   For testing only.
...
   diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
   --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200
   +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.0 +0200
   @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
   depends on GENERIC_HARDIRQS  SMP
   default y
  
   +config HARDIRQS_SW_RESEND
...
  Works fine with:
 
 Very nice! It would be about time this kernel should start behave...
 
  WARNING: at kernel/irq/resend.c:79 check_irq_resend()
  
  Call Trace:
...
 So, it looks like x86_64 io_apic's IPI code was unused too long...
 I hope it's a piece of cake for Ingo now...

So, we know now it's almost definitely something about lapic and IPIs
but, maybe it's not this code to blame...

Here is one more patch to check the possibility it's about the way
the resend edge type irqs are handled by level type handlers:
so, let's check if acking isn't too late...

Marcin and Jean-Baptiste: I would be very glad, as usual! And no need
to hurry; I think we know enough to fix this for you, but maybe this
test could explain if there are errors in lapics or only bad handling.

Many thanks,
Jarek P. 

PS: this patch is very experimental, and only intended for testing.
It should be applied to clean 2.6.23-rc1 or a bit older (eg. 2.6.22)
(so 2.6.23-rc2 or any patches from this thread shouldn't be around) 

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c2007-08-08 20:49:07.0 +0200
@@ -389,12 +389,19 @@ handle_fasteoi_irq(unsigned int irq, str
unsigned int cpu = smp_processor_id();
struct irqaction *action;
irqreturn_t action_ret;
+   int edge = 0;
 
spin_lock(desc-lock);
 
if (unlikely(desc-status  IRQ_INPROGRESS))
goto out;
 
+   if ((desc-status  (IRQ_PENDING | IRQ_REPLAY)) ==
+   IRQ_REPLAY) {
+   desc-chip-ack(irq);
+   edge = 1;
+   }
+
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;
 
@@ -421,7 +428,8 @@ handle_fasteoi_irq(unsigned int irq, str
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
 out:
-   desc-chip-eoi(irq);
+   if (!edge)
+   desc-chip-eoi(irq);
 
spin_unlock(desc-lock);
 }
diff -Nurp 2.6.23-rc1-/kernel/irq/resend.c 2.6.23-rc1/kernel/irq/resend.c
--- 2.6.23-rc1-/kernel/irq/resend.c 2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/resend.c  2007-08-08 20:44:14.0 +0200
@@ -57,14 +57,10 @@ void check_irq_resend(struct irq_desc *d
 {
unsigned int status = desc-status;
 
-   /*
-* Make sure the interrupt is enabled, before resending it:
-*/
-   desc-chip-enable(irq);
-
if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 
+   WARN_ON_ONCE(1);
if (!desc-chip || !desc-chip-retrigger ||
!desc-chip-retrigger(irq)) {
 #ifdef CONFIG_HARDIRQS_SW_RESEND
@@ -74,4 +70,5 @@ void check_irq_resend(struct irq_desc *d
 #endif
}
}
+   desc-chip-enable(irq);
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski

It seems, we can start to think about some preferred solutions,
already. Here are some of my preliminary conclusions and suggestions.

The problem of timeouts with some 'older' network cards seems to hit
mainly x86_64 arch, and after diagnosing and testing (still beeing
done) it's caused by resending level type irqs.

Possible solutions:

1. Reverting the whole do not mask interrupts by default patch:

Seems to be the most natural, but there are some minuses: this patch
has been here for a long time and some archs/drivers could have been
changed in the meantime and depend on this; this could also remove
possible good sides of this patch for which it was aimed (mainly for
edge type irqs), so some old problems could be back.


2. Excluding all level type irqs from resending. There are 2 ways:

a) Just like this is done now in -rc2 with Thomas' patch and AFAIK
seems to work very well; the way to find the type of interrupt by
the name of a handler looks a bit 'temporary' but it's simple and
working; it could be a bit moved and under #ifdef CONFIG_, so
this could affect only choosen ones.

b) Alternatively this could be done by e.g. assigning separate
irq_chip structures for edge and level handlers, so for levels there
would be chip-retrigger == NULL; so this could be done by arches
without any need for 'global' changes. The only minus here is a few
bytes of memory more; on the other hand it looks more readable and
elastic.

3. Using HARDIRQS_SW_RESEND for level type irqs:
seems to work, but needs more testing; but if #2 is theoretically
and practically OK, why bother. This also doesn't need any global
changes.

I prefer 2b + 3: since this is very delicate measure, I think there
should be visible CONFIG_ option: at least for disabling
chip-retrigger handler if used by arch and maybe for using
_SW_RESEND instead, as well.

It looks like these changes are needed for this x86_64 only, but
in my opinion some (good?!) apics could sometimes work OK with this
level resending only by chance: theoretically it's questionable:
since edge type irqs used for this should be resent if not acked, and
level handlers ack time depends on how long drivers do their job,
there are possible strange and hard to repeat effects, for no reason
(if these levels can really always be skipped without any problem,
like seen in testing).

Then of course it would be easier to do this 'globally' with 2a.:
skipping by default (but #ifdef) chip-retrigger namely for:
handler_fasteoi_irq and handler_level_irq and handler_simple_irq;
handle_edge_irq plus others are always tried. 

I hope, Ingo or Thomas will use something of these, or let us know
about maybe something else which should by tested for final inclusion.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-09 Thread Jarek Poplawski
On Thu, Aug 09, 2007 at 06:04:34PM +0200, Andi Kleen wrote:
 Jarek Poplawski [EMAIL PROTECTED] writes:
 
  It seems, we can start to think about some preferred solutions,
  already. Here are some of my preliminary conclusions and suggestions.
  
  The problem of timeouts with some 'older' network cards seems to hit
  mainly x86_64 arch, and after diagnosing and testing (still beeing
  done) it's caused by resending level type irqs.
 
 i386 interrupt code should be similar, except for the lack of 
 per CPU irqs, but that shouldnt' affect resending.
 
  
  Possible solutions:
 
 We should probably at least add some statistic counters to the
 standard kernel to try to detect these cases.
 
  It looks like these changes are needed for this x86_64 only,
 
 Why?

Maybe I missed something, but considering the popularity of i386
there was not so much of consistent reporting?! I was very surprised,
when I read a few days ago that Linus seems to think that this one
here is only an individual problem...

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote:
...
 So this afternoon i compiled 2.6.23-rc2 with same options as 2.6.23-rc1
 and edited grub.conf to add nosmp but after reboot the box did not
 responded. Back home, i saw that the kernel failed because it was unable
 to find the partitions (mdadm failed, then LVM). After a few tests,
 removing nosmp let the kernel boot correctly. It seems that even the
 fedora provided kernels have the same behavior
 (well at least 2.6.22.1-41.fc7).

Sorry: it seems there is some implementation error or some modules
don't check CONFIG_SMP enough...

Of course testing this with smp should be precious too.
Only, after finding some problems, you should consider smp is quite
a new and complicated technology, at least regarding such old designs
as 3c905.

BTW: I didn't notice this yesterday, but your forcedeth uses new type
of irq handling (MSI), so it should explain why it's not affected.

Jean-Baptiste: I'm not sure how much of this testing you can afford?
If you can spare some time for this and your box isn't for
'production' it could be very precious to diagnose such reproducible
bug.

Then, I'd have a few suggestions (you could choose any of them) like:
- trying these last test patches prepared for Marcin, too (but only
with kernels 2.6.21 - 2.6.23-rc1),
- trying to find the last kernel version, which works for you:
Marcin has done this with successfully using the most professional
way: git bisect (which btw. I did learn yet), but, IMHO, it could be
very usable to try a poor man's bisect too older kernels like this:
2.6.18, so to try again this version of previos Fedora, but
preferably in vanilla version (there could be some problems if
something in your configs or hardware has changed); then if OK:
2.6.20; if OK 2.6.21-rc1 or -rc2 (there are usually heavy changes
in the beginning of a cycle); ithen try to jump forward or backward
around the middle of the range eg. -rc4. You should use each time the
same, current config and remember to 'make oldconfig' before make.

In my opinion it would be very precious even after some long time,
so there is no need to hurry and do this now. The most important:
if nothing has changed with your hardware in the meantime, you
should find 'the culprit' for sure.

But, if there are any problems about such testing, don't bother!
It could be really a lot of hard and maybe boring work.

If you would like to read something more about testing (then of
course my suggestions could occur invalid - I'm a very bad tester
myself...) you can try this:
http://www.stardust.webpages.pl/files/handbook/

If you would need some additional advice you can mail me privately
too (but my response could take a few days). Of course, if your find
something interresting I'd be glad to know about this, but let's be
honest - I'm not any authority about these drivers, so cc-ing a
maintainer should always be more usable.

Thanks,
Jarek P.

PS: it would be nice if you could fix your mail program on line
breaking (or try to do this manually).

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 09:21:14AM +0200, Jarek Poplawski wrote:
 On Tue, Aug 07, 2007 at 07:16:33PM +0200, Jean-Baptiste Vignaud wrote:
...
 Marcin has done this with successfully using the most professional
 way: git bisect (which btw. I did learn yet), but, IMHO, it could be
...
Let me say this slow and distinctly: I didn't learn yet! (Shame on me!)
Sorry for these misspelings here and there...

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jean-Baptiste Vignaud
 Jean-Baptiste: I'm not sure how much of this testing you can afford?
 If you can spare some time for this and your box isn't for
 'production' it could be very precious to diagnose such reproducible
 bug.

Well i can continue testing patches for sure.

 Then, I'd have a few suggestions (you could choose any of them) like:
 - trying these last test patches prepared for Marcin, too (but only
 with kernels 2.6.21 - 2.6.23-rc1),

I'v patched 2.6.23-rc2 with those patches yesterday evening, and
launched samba copy. 
Is rc2 ok ?

This morning the network is still up :
RX bytes:279853499958 (260.6 GiB)  TX bytes:7416695531 (6.9 GiB)

Still testing.

 If you would like to read something more about testing (then of
 course my suggestions could occur invalid - I'm a very bad tester
 myself...) you can try this:
 http://www.stardust.webpages.pl/files/handbook/

I'll have a look at the document. 

Jb


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote:
  Jean-Baptiste: I'm not sure how much of this testing you can afford?
  If you can spare some time for this and your box isn't for
  'production' it could be very precious to diagnose such reproducible
  bug.
 
 Well i can continue testing patches for sure.

Great!

 
  Then, I'd have a few suggestions (you could choose any of them) like:
  - trying these last test patches prepared for Marcin, too (but only
  with kernels 2.6.21 - 2.6.23-rc1),
 
 I'v patched 2.6.23-rc2 with those patches yesterday evening, and
 launched samba copy.
 Is rc2 ok ?

Yes! Mostly... 2.6.23-rc2 has a temporary patch applied, which should
work by itself (at last it works for Marcin). So, it's very good news
it works for you too. But, as a matter of fact the other patches
(I hope you mean these yesterday's two) probably are not used very
much (the last one could do some work but with other irqs).

So, it would be interesting to try them with e.g. 2.6.23-rc1. But not
together (I'd remind that after applying such a patch, make oldconfig,
make and so on plus testing, you can revert it with the same command
you used to patch plus -R option (e.g.: patch -p1 -R  ../patch1.diff),
to save some time on restoring a 'vanilla' kernel version.
The aim of these newer patches is to find why exactly this patch in
-rc2 works...

Cheers,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
 So, the let's try this idea yet: modified Ingo's x86: activate
 HARDIRQS_SW_RESEND patch.
 (Don't forget about make oldconfig before make.)
 For testing only.

 Cheers,
 Jarek P.

 PS: alas there was not even time for compile checking...

 ---

 diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
 --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200
 +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.0 +0200
 @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
 depends on GENERIC_HARDIRQS  SMP
 default y

 +config HARDIRQS_SW_RESEND
 +   bool
 +   default y
 +
  config X86_SMP
 bool
 depends on SMP  !X86_VOYAGER
 diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
 --- 2.6.22.1-/arch/x86_64/Kconfig   2007-07-09 01:32:17.0 +0200
 +++ 2.6.22.1/arch/x86_64/Kconfig2007-08-07 13:13:03.0 +0200
 @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
 depends on GENERIC_HARDIRQS  SMP
 default y

 +config HARDIRQS_SW_RESEND
 +   bool
 +   default y
 +
  menu Power management options

  source kernel/power/Kconfig
 diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
 --- 2.6.22.1-/kernel/irq/manage.c   2007-07-09 01:32:17.0 +0200
 +++ 2.6.22.1/kernel/irq/manage.c2007-08-07 13:13:03.0 +0200
 @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
 desc-depth--;
 }
 spin_unlock_irqrestore(desc-lock, flags);
 +#ifdef CONFIG_HARDIRQS_SW_RESEND
 +   /*
 +* Do a bh disable/enable pair to trigger any pending
 +* irq resend logic:
 +*/
 +   local_bh_disable();
 +   local_bh_enable();
 +#endif
  }
  EXPORT_SYMBOL(enable_irq);

 diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
 --- 2.6.22.1-/kernel/irq/resend.c   2007-07-09 01:32:17.0 +0200
 +++ 2.6.22.1/kernel/irq/resend.c2007-08-07 13:57:54.0 +0200
 @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
  */
 desc-chip-enable(irq);

 +   /*
 +* Temporary hack to figure out more about the problem, which
 +* is causing the ancient network cards to die.
 +*/
 +
 if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
 desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;

 -   if (!desc-chip || !desc-chip-retrigger ||
 -   !desc-chip-retrigger(irq)) {
 +   if (desc-handle_irq == handle_edge_irq) {
 +   if (desc-chip-retrigger)
 +   desc-chip-retrigger(irq);
 +   return;
 +   }
  #ifdef CONFIG_HARDIRQS_SW_RESEND
 -   /* Set it pending and activate the softirq: */
 -   set_bit(irq, irqs_resend);
 -   tasklet_schedule(resend_tasklet);
 +   WARN_ON_ONCE(1);
 +   /* Set it pending and activate the softirq: */
 +   set_bit(irq, irqs_resend);
 +   tasklet_schedule(resend_tasklet);
  #endif
 -   }
 }
  }

Works fine with:
WARNING: at kernel/irq/resend.c:79 check_irq_resend()

Call Trace:
 [8025e660] check_irq_resend+0xc0/0xd0
 [8025e1cd] enable_irq+0xed/0xf0
 [8807f21d] :8390:ei_start_xmit+0x14d/0x30c
 [8024d055] lock_release_non_nested+0xe5/0x190
 [80539b78] __qdisc_run+0x98/0x1f0
 [80539b8e] __qdisc_run+0xae/0x1f0
 [8052b65e] dev_hard_start_xmit+0x26e/0x2d0
 [80539ba0] __qdisc_run+0xc0/0x1f0
 [8052dc2f] dev_queue_xmit+0x24f/0x310
 [805337a7] neigh_resolve_output+0xe7/0x290
 [8054f5c0] dst_output+0x0/0x10
 [80552aff] ip_output+0x19f/0x340
 [80551f77] ip_queue_xmit+0x217/0x430
 [80563b2a] tcp_transmit_skb+0x40a/0x7c0
 [805657bb] __tcp_push_pending_frames+0x11b/0x940
 [8055972a] tcp_sendmsg+0x87a/0xc80
 [80577735] inet_sendmsg+0x45/0x80
 [8051e2d4] sock_aio_write+0x104/0x120
 [80285fc1] do_sync_write+0xf1/0x130
 [80243290] autoremove_wake_function+0x0/0x40
 [802868e9] vfs_write+0x159/0x170
 [80286ef0] sys_write+0x50/0x90
 [802097fe] system_call+0x7e/0x83
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Marcin Ślusarz
2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
 And here is one more patch to test the same idea (chip-retrigger()).
 Let's try i386 way! (I hope I will not be arrested for this...)
 (Should be tested without any previous patches.)

 Jarek P.

 PS: as above

 ---

 diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 
 2.6.22.1/arch/x86_64/kernel/io_apic.c
 --- 2.6.22.1-/arch/x86_64/kernel/io_apic.c  2007-07-09 01:32:17.0 
 +0200
 +++ 2.6.22.1/arch/x86_64/kernel/io_apic.c   2007-08-07 14:37:45.0 
 +0200
 @@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
  static int ioapic_retrigger_irq(unsigned int irq)
  {
 struct irq_cfg *cfg = irq_cfg[irq];
 -   cpumask_t mask;
 -   unsigned long flags;
 -
 -   spin_lock_irqsave(vector_lock, flags);
 -   cpus_clear(mask);
 -   cpu_set(first_cpu(cfg-domain), mask);

 -   send_IPI_mask(mask, cfg-vector);
 -   spin_unlock_irqrestore(vector_lock, flags);
 +   send_IPI_self(cfg-vector);

 return 1;
  }

Network card timed out with this patch.

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
Read below please:

On Wed, Aug 08, 2007 at 01:09:36PM +0200, Marcin Ślusarz wrote:
 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
  So, the let's try this idea yet: modified Ingo's x86: activate
  HARDIRQS_SW_RESEND patch.
  (Don't forget about make oldconfig before make.)
  For testing only.
 
  Cheers,
  Jarek P.
 
  PS: alas there was not even time for compile checking...
 
  ---
 
  diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
  --- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200
  +++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.0 +0200
  @@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
  depends on GENERIC_HARDIRQS  SMP
  default y
 
  +config HARDIRQS_SW_RESEND
  +   bool
  +   default y
  +
   config X86_SMP
  bool
  depends on SMP  !X86_VOYAGER
  diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
  --- 2.6.22.1-/arch/x86_64/Kconfig   2007-07-09 01:32:17.0 +0200
  +++ 2.6.22.1/arch/x86_64/Kconfig2007-08-07 13:13:03.0 +0200
  @@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
  depends on GENERIC_HARDIRQS  SMP
  default y
 
  +config HARDIRQS_SW_RESEND
  +   bool
  +   default y
  +
   menu Power management options
 
   source kernel/power/Kconfig
  diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
  --- 2.6.22.1-/kernel/irq/manage.c   2007-07-09 01:32:17.0 +0200
  +++ 2.6.22.1/kernel/irq/manage.c2007-08-07 13:13:03.0 +0200
  @@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
  desc-depth--;
  }
  spin_unlock_irqrestore(desc-lock, flags);
  +#ifdef CONFIG_HARDIRQS_SW_RESEND
  +   /*
  +* Do a bh disable/enable pair to trigger any pending
  +* irq resend logic:
  +*/
  +   local_bh_disable();
  +   local_bh_enable();
  +#endif
   }
   EXPORT_SYMBOL(enable_irq);
 
  diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
  --- 2.6.22.1-/kernel/irq/resend.c   2007-07-09 01:32:17.0 +0200
  +++ 2.6.22.1/kernel/irq/resend.c2007-08-07 13:57:54.0 +0200
  @@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
   */
  desc-chip-enable(irq);
 
  +   /*
  +* Temporary hack to figure out more about the problem, which
  +* is causing the ancient network cards to die.
  +*/
  +
  if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
  desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 
  -   if (!desc-chip || !desc-chip-retrigger ||
  -   !desc-chip-retrigger(irq)) {
  +   if (desc-handle_irq == handle_edge_irq) {
  +   if (desc-chip-retrigger)
  +   desc-chip-retrigger(irq);
  +   return;
  +   }
   #ifdef CONFIG_HARDIRQS_SW_RESEND
  -   /* Set it pending and activate the softirq: */
  -   set_bit(irq, irqs_resend);
  -   tasklet_schedule(resend_tasklet);
  +   WARN_ON_ONCE(1);
  +   /* Set it pending and activate the softirq: */
  +   set_bit(irq, irqs_resend);
  +   tasklet_schedule(resend_tasklet);
   #endif
  -   }
  }
   }
 
 Works fine with:

Very nice! It would be about time this kernel should start behave...

 WARNING: at kernel/irq/resend.c:79 check_irq_resend()
 
 Call Trace:
  [8025e660] check_irq_resend+0xc0/0xd0
  [8025e1cd] enable_irq+0xed/0xf0
  [8807f21d] :8390:ei_start_xmit+0x14d/0x30c
  [8024d055] lock_release_non_nested+0xe5/0x190
  [80539b78] __qdisc_run+0x98/0x1f0
  [80539b8e] __qdisc_run+0xae/0x1f0
  [8052b65e] dev_hard_start_xmit+0x26e/0x2d0
  [80539ba0] __qdisc_run+0xc0/0x1f0
  [8052dc2f] dev_queue_xmit+0x24f/0x310
  [805337a7] neigh_resolve_output+0xe7/0x290
  [8054f5c0] dst_output+0x0/0x10
  [80552aff] ip_output+0x19f/0x340
  [80551f77] ip_queue_xmit+0x217/0x430
  [80563b2a] tcp_transmit_skb+0x40a/0x7c0
  [805657bb] __tcp_push_pending_frames+0x11b/0x940
  [8055972a] tcp_sendmsg+0x87a/0xc80
  [80577735] inet_sendmsg+0x45/0x80
  [8051e2d4] sock_aio_write+0x104/0x120
  [80285fc1] do_sync_write+0xf1/0x130
  [80243290] autoremove_wake_function+0x0/0x40
  [802868e9] vfs_write+0x159/0x170
  [80286ef0] sys_write+0x50/0x90
  [802097fe] system_call+0x7e/0x83
 

So, it looks like x86_64 io_apic's IPI code was unused too long...
I hope it's a piece of cake for Ingo now...

Thanks very much Marcin!

If it's possible for you and Jean-Baptiste, try this today patch
with -rc2, and maybe once more this one patch (-rc1 or older).

Regards,
Jarek P. 

-
To unsubscribe 

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 01:42:43PM +0200, Jarek Poplawski wrote:
...
 So, it looks like x86_64 io_apic's IPI code was unused too long...

To be fair it's x86_64 lapic's IPI code.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-08 Thread Jarek Poplawski
On Wed, Aug 08, 2007 at 10:59:22AM +0200, Jean-Baptiste Vignaud wrote:
...
  If you would like to read something more about testing (then of
  course my suggestions could occur invalid - I'm a very bad tester
  myself...) you can try this:
  http://www.stardust.webpages.pl/files/handbook/
 
 I'll have a look at the document.

BTW: this document describes some methods for a kind of 'professional'
testing (so you could save time if you do it very often). But, you
shouldn't think all this knowledge or tools are necessary. So, you can
skip many such things and still do very valuable testing with simpler
methods. And there are a lot of simple  good advices as well.

BTW #2: this all testing of older versions, which I've described, has
of course any reason only if after present patches you'll still think
the older kernel had worked better for you.

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 05:19:03PM -0400, Chuck Ebbert wrote:
 On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote:
  Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 
  3com card failed with the latest fedora kernel.
  
  Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
  Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status 
  e601.
  Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 
  003a fifo 8000
  Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- 
  IRQ blocked by another device?
  Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) 
  current 26085000(8)
  Aug  6 22:31:09 loki kernel:   Transmit list  vs. 81007c807700.
  
  Stressing eth2 by copying large files on a samba on share and eth0 by 
  downloading big files on the internet.
 
 So even the full revert doesn't fix the 3Com driver, it just makes it less
 likely to do that.
 
 The other patch probably won't be any better -- I'd guess there's some
 kind of IRQ handling bug in that driver.
 

I don't know how fast are these 3com chips regarding these 8390
described by Alan, and how are irqs shared on Jean-Baptiste's box,
but I'm surprised they could have worked sharing interrupts and
without such time outs before this change in 2.6.21. It seems some
of those older chips, because of slowness, could have transmit
problems even without irq sharing. So, IMHO, if possible, there
should be never irq sharing enabled between two (or more) drivers
using both disable_irq.

These time out problems were reported long time ago, but I think
it would be nice if this thread could at least remove these new
problems reported only after 2.6.21, which it seems is possible
now, after Marcin's diagnose: by reverting the whole 2.6.21 patch
or by this current temporary patch in 2.6.23-rc2's resend.c.
It would be nice if you could try this patch too.

BTW: Jean-Babtiste, could you send or point to you current configs?
I mean at least proc/interrupts, but with dmesg and .config it would
be even better. (I assume this last report was about the revert patch
mentioned by Chuck, not the one below your message?)

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Marcin Ślusarz
2007/8/6, Ingo Molnar [EMAIL PROTECTED]:
 (..)
 please try Jarek's second patch too - there was a missing unmask.

 Ingo

 --
 Subject: genirq: fix simple and fasteoi irq handlers
 From: Jarek Poplawski [EMAIL PROTECTED]

 After the genirq: do not mask interrupts by default patch interrupts
 should be disabled not immediately upon request, but after they happen.
 But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
 more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
 driver's work.

 The main reason of problems here, pointing the broken patch and making
 the first patch which can fix this was done by Marcin Slusarz.
 Additional test patches of Thomas Gleixner and Ingo Molnar tested by
 Marcin Slusarz helped to narrow possible reasons even more. Thanks.

 PS: this patch fixes only one evident error here, but there could be
 more places affected by above-mentioned change in irq handling.

 PS 2:
 After rethinking, IMHO, there are two most probable scenarios here:

 1. After hw resend there could be a conflict between retriggered
 edge type irq and the next level type one: e.g. if this level type
 irq (io_apic is enabled then) is triggered while retriggered irq is
 serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
 the next such levels are triggered and looping, so probably kind of
 flood in io_apic until this retriggered edge service has ended.
 2. There is something wrong with ioapic_retrigger_irq (less probable
 because this should be probably seen with 'normal' edge retriggers,
 but on the other hand, they could be less common).

 So, if there is #1, this fixed patch should work.

 But, since level types don't need this retriggers too much I think
 this don't mask interrupts by default idea should be rethinked:
 is there enough gain to risk such hard to diagnose errors?

 So, IMHO, there should be at least possibility to turn this off for
 level types in config (it should be a visible option, so people could
 find  try this before writing for help or changing a network card).


 Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

 ---

 diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
 --- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
 +++ 2.6.23-rc1/kernel/irq/chip.c2007-08-05 21:49:46.0 +0200
 @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru

 spin_lock(desc-lock);

 -   if (unlikely(desc-status  IRQ_INPROGRESS))
 -   goto out_unlock;
 kstat_cpu(cpu).irqs[irq]++;

 action = desc-action;
 -   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
 +   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
 +IRQ_DISABLED {
 if (desc-chip-mask)
 desc-chip-mask(irq);
 desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
 @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru

 spin_lock(desc-lock);
 desc-status = ~IRQ_INPROGRESS;
 +   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
 +   desc-chip-unmask(irq);
  out_unlock:
 spin_unlock(desc-lock);
  }
 @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str

 spin_lock(desc-lock);

 -   if (unlikely(desc-status  IRQ_INPROGRESS))
 -   goto out;
 -
 desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
 kstat_cpu(cpu).irqs[irq]++;

 /*
 -* If its disabled or no action available
 +* If it's running, disabled or no action available
  * then mask it and get out of here:
  */
 action = desc-action;
 -   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
 +   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
 +IRQ_DISABLED {
 desc-status |= IRQ_PENDING;
 if (desc-chip-mask)
 desc-chip-mask(irq);
 @@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str

 spin_lock(desc-lock);
 desc-status = ~IRQ_INPROGRESS;
 +   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
 +   desc-chip-unmask(irq);
  out:
 desc-chip-eoi(irq);


Network card still locks up (tested on 2.6.22.1). I had to upload more
data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
might be a coincidence...

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
 2007/8/6, Ingo Molnar [EMAIL PROTECTED]:
  (..)
  please try Jarek's second patch too - there was a missing unmask.
 
  Ingo
 
  --
  Subject: genirq: fix simple and fasteoi irq handlers
  From: Jarek Poplawski [EMAIL PROTECTED]
...
 Network card still locks up (tested on 2.6.22.1). I had to upload more
 data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
 might be a coincidence...

Thanks! It's a good news after all - it would be really strange why
this place doesn't hit more people (it seems there is some safety
elsewhere for this).

BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
(with a warning), had no problems with a similar load?

So, once more, I would suspect hw retrigger code. Ingo, IMHO, this
patch for testing HARDIRQS_SW_RESEND could be reworked, so that
desc-chip-retrigger() is done only for eadges and the tasklet only
for levels. BTW, I think this current warning in the temporary is
is too early - we don't know if after this the -retrigger() will
take place.

Regards,
Jarek P.

PS: Marcin, if you need a break in this testing let us know!
I think the main idea of this bug is known enough.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud

 BTW: Jean-Babtiste, could you send or point to you current configs?
 I mean at least proc/interrupts, but with dmesg and .config it would
 be even better. (I assume this last report was about the revert patch
 mentioned by Chuck, not the one below your message?)

Sure. 

Last reports are with the 2.6.22.1-41.fc7 kernel, which has in changelog :

* Sat Jul 28 2007 Chuck Ebbert [EMAIL PROTECTED]
- revert upstream genirq: do not mask interrupts by default


* interrupts (i use irqbalance, but problem was the same without)

[EMAIL PROTECTED] ~]# cat /proc/interrupts 
   CPU0   CPU1   
  0:   44874910668   IO-APIC-edge  timer
  1:241 58   IO-APIC-edge  i8042
  8:  0  0   IO-APIC-edge  rtc0
  9:  0  0   IO-APIC-fasteoi   acpi
 12:  2139   IO-APIC-edge  i8042
 14:  0  0   IO-APIC-edge  libata
 15:  0  0   IO-APIC-edge  libata
 16:  72625 96   IO-APIC-fasteoi   eth1
 17:   4667128   IO-APIC-fasteoi   eth2
 20:   4156  39870   IO-APIC-fasteoi   sata_nv
 21:  34794   9177   IO-APIC-fasteoi   sata_nv
 22:  0  0   IO-APIC-fasteoi   ehci_hcd:usb2
 23:   6005   1565   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
2297:  3 492180   PCI-MSI-edge  eth0
NMI:  0  0 
LOC:49153454915282 
ERR:  0

problems are with eth1 and eth2 here. never had any problems with the onboard 
(eth0).

* pci

00:00.0 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a1)
00:01.0 ISA bridge: nVidia Corporation MCP55 LPC Bridge (rev a2)
00:01.1 SMBus: nVidia Corporation MCP55 SMBus (rev a2)
00:01.2 RAM memory: nVidia Corporation MCP55 Memory Controller (rev a2)
00:02.0 USB Controller: nVidia Corporation MCP55 USB Controller (rev a1)
00:02.1 USB Controller: nVidia Corporation MCP55 USB Controller (rev a2)
00:04.0 IDE interface: nVidia Corporation MCP55 IDE (rev a1)
00:05.0 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.1 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:05.2 IDE interface: nVidia Corporation MCP55 SATA Controller (rev a2)
00:06.0 PCI bridge: nVidia Corporation MCP55 PCI bridge (rev a2)
00:08.0 Bridge: nVidia Corporation MCP55 Ethernet (rev a2)
00:0a.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0b.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0c.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0d.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0e.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:0f.0 PCI bridge: nVidia Corporation MCP55 PCI Express bridge (rev a2)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:06.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
01:07.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
07:00.0 VGA compatible controller: nVidia Corporation NV44 [GeForce 6200 LE] 
(rev a1)

* dmesg (from a reboot this morning)

Linux version 2.6.22.1-41.fc7 ([EMAIL PROTECTED]) (gcc version 4.1.2 20070502 
(Red Hat 4.1.2-12)) #1 SMP Fri Jul 27 18:21:43 EDT 2007
Command line: ro root=/dev/all/root
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f000 (usable)
 BIOS-e820: 0009f000 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 7fee (usable)
 BIOS-e820: 7fee - 7fee3000 (ACPI NVS)
 BIOS-e820: 7fee3000 - 7fef (ACPI data)
 BIOS-e820: 7fef - 7ff0 (reserved)
 BIOS-e820: f000 - f400 (reserved)
 BIOS-e820: fec0 - 0001 (reserved)
Entering add_active_range(0, 0, 159) 0 entries of 3200 used
Entering add_active_range(0, 256, 524000) 1 entries of 3200 used
end_pfn_map = 1048576
DMI 2.4 present.
ACPI: RSDP 000F7620, 0024 (r2 Nvidia)
ACPI: XSDT 7FEE30C0, 0044 (r1 Nvidia ASUSACPI 42302E31 AWRD0)
ACPI: FACP 7FEEC400, 00F4 (r3 Nvidia ASUSACPI 42302E31 AWRD0)
ACPI: DSDT 7FEE3240, 9164 (r1 NVIDIA AWRDACPI 1000 MSFT  300)
ACPI: FACS 7FEE, 0040
ACPI: HPET 7FEEC600, 0038 (r1 Nvidia ASUSACPI 42302E31 AWRD   98)
ACPI: MCFG 7FEEC680, 003C (r1 Nvidia ASUSACPI 42302E31 AWRD0)
ACPI: APIC 7FEEC540, 007C (r1 Nvidia ASUSACPI 42302E31 AWRD0)
Scanning NUMA topology in Northbridge 24
No NUMA configuration found
Faking a node at -7fee
Entering add_active_range(0, 0, 159) 

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jean-Baptiste Vignaud

  * interrupts (i use irqbalance, but problem was the same without)
 
 I wonder if you tried without SMP too?

No i did not. Do you think that this can be a problem ?
To test with no SMP, do i need to recompile kernel or is there a kernel 
parameter ?



 BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
 time, there is no use for testing my genirq: fix simple and fasteoi
 irq handlers patch.

Well i just  tested 2.6.23-rc1 with your patch and copied (using smbclient) big 
files :

Aug  7 11:11:53 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
Aug  7 11:11:53 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
Aug  7 11:11:53 loki kernel:   diagnostics: net 0ccc media 8880 dma 003a 
fifo 
Aug  7 11:11:53 loki kernel: eth2: Interrupt posted but not delivered -- IRQ 
blocked by another device?
Aug  7 11:11:53 loki kernel:   Flags; bus-master 1, dirty 93481(9) current 
93481(9)
Aug  7 11:11:53 loki kernel:   Transmit list  vs. 81007be977a0.
Aug  7 11:11:53 loki kernel:   0: @81007be97200  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   1: @81007be972a0  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   2: @81007be97340  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   3: @81007be973e0  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   4: @81007be97480  length 803c status 
0001003c
Aug  7 11:11:53 loki kernel:   5: @81007be97520  length 803c status 
0001003c
Aug  7 11:11:53 loki kernel:   6: @81007be975c0  length 803c status 
0001003c
Aug  7 11:11:53 loki kernel:   7: @81007be97660  length 803c status 
8001003c
Aug  7 11:11:53 loki kernel:   8: @81007be97700  length 803c status 
8001003c
Aug  7 11:11:53 loki kernel:   9: @81007be977a0  length 802a status 
0001002a
Aug  7 11:11:53 loki kernel:   10: @81007be97840  length 803a status 
0001003a
Aug  7 11:11:53 loki kernel:   11: @81007be978e0  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   12: @81007be97980  length 80be status 
0c0100be
Aug  7 11:11:53 loki kernel:   13: @81007be97a20  length 80be status 
0c0100be
Aug  7 11:11:53 loki kernel:   14: @81007be97ac0  length 805f status 
0001005f
Aug  7 11:11:53 loki kernel:   15: @81007be97b60  length 805f status 
0001005f

Thanks;

Jb

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 10:10:34AM +0200, Jean-Baptiste Vignaud wrote:
 
  BTW: Jean-Babtiste, could you send or point to you current configs?

Oops! I'm very sorry for misspelling!

  I mean at least proc/interrupts, but with dmesg and .config it would
  be even better. (I assume this last report was about the revert patch
  mentioned by Chuck, not the one below your message?)
 
 Sure.
 
 Last reports are with the 2.6.22.1-41.fc7 kernel, which has in changelog :
 
 * Sat Jul 28 2007 Chuck Ebbert [EMAIL PROTECTED]
 - revert upstream genirq: do not mask interrupts by default
 
 
 * interrupts (i use irqbalance, but problem was the same without)

I wonder if you tried without SMP too?

 
 [EMAIL PROTECTED] ~]# cat /proc/interrupts
CPU0   CPU1
...
  16:  72625 96   IO-APIC-fasteoi   eth1
  17:   4667128   IO-APIC-fasteoi   eth2
  20:   4156  39870   IO-APIC-fasteoi   sata_nv
  21:  34794   9177   IO-APIC-fasteoi   sata_nv
  22:  0  0   IO-APIC-fasteoi   ehci_hcd:usb2
  23:   6005   1565   IO-APIC-fasteoi   ohci_hcd:usb1, sata_nv
 2297:  3 492180   PCI-MSI-edge  eth0
 NMI:  0  0
 LOC:49153454915282
 ERR:  0

So, here it's not about irq sharing...

 
 problems are with eth1 and eth2 here. never had any problems with the onboard 
 (eth0).
...
 
 * .config
 
 i dont have it, it was the standard fedora one.
 
 i'm not sure that the problem is related to 3com, because i replaced those 
 cards by old card i had in spare :
 
 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
 
 and i had the exact same problem.
 
 Those 3com cards were working 24/24 before i went to fedora 7 (and kernel 
 2.6.21 then).

It seems from 2.6.21 the problems are mainly about 'older' network
chips on x86_64. This reverted patch should mean only for those
using disable_irq, but I see forcedeth could use this too so it's
not clear yet, and btw. there where other changes around irqs and
pci, so everybody could have something a bit different with similar
time outs logs...

BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
time, there is no use for testing my genirq: fix simple and fasteoi
irq handlers patch.

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:21:07AM +0200, Jean-Baptiste Vignaud wrote:
 
   * interrupts (i use irqbalance, but problem was the same without)
 
  I wonder if you tried without SMP too?
 
 No i did not. Do you think that this can be a problem ?
 To test with no SMP, do i need to recompile kernel or is there a kernel 
 parameter ?

It's always better to exclude any complications if it's possible.
Yes, there is the kernel parameter for this: nosmp. So, if you
have some time to spare I think 2.6.23-rc2 with this nosmp
could be an interesting option.

 
 
  BTW, Jean-Baptiste and Chuck - it seems, unless you have too much
  time, there is no use for testing my genirq: fix simple and fasteoi
  irq handlers patch.
 
 Well i just  tested 2.6.23-rc1 with your patch and copied (using smbclient) 
 big files :
 
 Aug  7 11:11:53 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
 Aug  7 11:11:53 loki kernel: eth2: transmit timed out, tx_status 00 status 
 e601.
 Aug  7 11:11:53 loki kernel:   diagnostics: net 0ccc media 8880 dma 003a 
 fifo 
 Aug  7 11:11:53 loki kernel: eth2: Interrupt posted but not delivered -- IRQ 
 blocked by another device?
 Aug  7 11:11:53 loki kernel:   Flags; bus-master 1, dirty 93481(9) current 
 93481(9)
 Aug  7 11:11:53 loki kernel:   Transmit list  vs. 81007be977a0.
 Aug  7 11:11:53 loki kernel:   0: @81007be97200  length 805f status 
 0001005f
...

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
 2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
  On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
   Network card still locks up (tested on 2.6.22.1). I had to upload more
   data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
   might be a coincidence...
 
  Thanks! It's a good news after all - it would be really strange why
  this place doesn't hit more people (it seems there is some safety
  elsewhere for this).
 
  BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
  (with a warning), had no problems with a similar load?
 I always tested on 500-600 MB dataset
 
  PS: Marcin, if you need a break in this testing let us know!
 No, i don't need a break. I'll have more time in next weeks.

Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
if Ingo doesn't prepare something for you.

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Mon, Aug 06, 2007 at 01:43:48PM -0400, Chuck Ebbert wrote:
 On 08/06/2007 03:03 AM, Ingo Molnar wrote:
  
  But, since level types don't need this retriggers too much I think
  this don't mask interrupts by default idea should be rethinked:
  is there enough gain to risk such hard to diagnose errors?

  
 
 I reverted those masking changes in Fedora and the baffling problem
 with 3Com 3C905 network adapters went away.
 
 Before, they would print:
 
 eth0: transmit timed out, tx_status 00 status e601.
   diagnostics: net 0ccc media 8880 dma 003a fifo 
 eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
   Flags; bus-master 1, dirty 295757(13) current 295757(13)
   Transmit list  vs. f7150a20.
   0: @f7150200  length 8070 status 0c010070
   1: @f71502a0  length 8070 status 0c010070
   2: @f7150340  length 805c status 0c01005c
 
 Now they just work, apparently...
 
 So why not just revert the change?
 

Ingo has written about such possibility. But, it would be good
to know which precisely place is to blame, as well. Since this
diagnosing takes time, I think Chuck is right, and maybe at least
this temporary patch for resend.c without this warning, should
be recomended for stables (2.6.21 and 2.6.22)?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
 On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
  2007/8/7, Jarek Poplawski [EMAIL PROTECTED]:
   On Tue, Aug 07, 2007 at 09:46:36AM +0200, Marcin Ślusarz wrote:
Network card still locks up (tested on 2.6.22.1). I had to upload more
data than usual (~350 MB vs ~1-100 MB) to trigger that bug but it
might be a coincidence...
  
   Thanks! It's a good news after all - it would be really strange why
   this place doesn't hit more people (it seems there is some safety
   elsewhere for this).
  
   BTW: I hope, this previous Thomas' patch with Ingo's warning to resend.c
   (with a warning), had no problems with a similar load?
  I always tested on 500-600 MB dataset
  
   PS: Marcin, if you need a break in this testing let us know!
  No, i don't need a break. I'll have more time in next weeks.
 
 Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
 if Ingo doesn't prepare something for you.

So, the let's try this idea yet: modified Ingo's x86: activate
HARDIRQS_SW_RESEND patch.
(Don't forget about make oldconfig before make.)
For testing only.

Cheers,
Jarek P.

PS: alas there was not even time for compile checking...

---

diff -Nurp 2.6.22.1-/arch/i386/Kconfig 2.6.22.1/arch/i386/Kconfig
--- 2.6.22.1-/arch/i386/Kconfig 2007-07-09 01:32:17.0 +0200
+++ 2.6.22.1/arch/i386/Kconfig  2007-08-07 13:13:03.0 +0200
@@ -1252,6 +1252,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS  SMP
default y
 
+config HARDIRQS_SW_RESEND
+   bool
+   default y
+
 config X86_SMP
bool
depends on SMP  !X86_VOYAGER
diff -Nurp 2.6.22.1-/arch/x86_64/Kconfig 2.6.22.1/arch/x86_64/Kconfig
--- 2.6.22.1-/arch/x86_64/Kconfig   2007-07-09 01:32:17.0 +0200
+++ 2.6.22.1/arch/x86_64/Kconfig2007-08-07 13:13:03.0 +0200
@@ -690,6 +690,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS  SMP
default y
 
+config HARDIRQS_SW_RESEND
+   bool
+   default y
+
 menu Power management options
 
 source kernel/power/Kconfig
diff -Nurp 2.6.22.1-/kernel/irq/manage.c 2.6.22.1/kernel/irq/manage.c
--- 2.6.22.1-/kernel/irq/manage.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.22.1/kernel/irq/manage.c2007-08-07 13:13:03.0 +0200
@@ -169,6 +169,14 @@ void enable_irq(unsigned int irq)
desc-depth--;
}
spin_unlock_irqrestore(desc-lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+   /*
+* Do a bh disable/enable pair to trigger any pending
+* irq resend logic:
+*/
+   local_bh_disable();
+   local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
diff -Nurp 2.6.22.1-/kernel/irq/resend.c 2.6.22.1/kernel/irq/resend.c
--- 2.6.22.1-/kernel/irq/resend.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.22.1/kernel/irq/resend.c2007-08-07 13:57:54.0 +0200
@@ -62,16 +62,24 @@ void check_irq_resend(struct irq_desc *d
 */
desc-chip-enable(irq);
 
+   /*
+* Temporary hack to figure out more about the problem, which
+* is causing the ancient network cards to die.
+*/
+
if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 
-   if (!desc-chip || !desc-chip-retrigger ||
-   !desc-chip-retrigger(irq)) {
+   if (desc-handle_irq == handle_edge_irq) {
+   if (desc-chip-retrigger)
+   desc-chip-retrigger(irq);
+   return;
+   }
 #ifdef CONFIG_HARDIRQS_SW_RESEND
-   /* Set it pending and activate the softirq: */
-   set_bit(irq, irqs_resend);
-   tasklet_schedule(resend_tasklet);
+   WARN_ON_ONCE(1);
+   /* Set it pending and activate the softirq: */
+   set_bit(irq, irqs_resend);
+   tasklet_schedule(resend_tasklet);
 #endif
-   }
}
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-07 Thread Jarek Poplawski
On Tue, Aug 07, 2007 at 02:13:39PM +0200, Jarek Poplawski wrote:
 On Tue, Aug 07, 2007 at 11:52:46AM +0200, Jarek Poplawski wrote:
  On Tue, Aug 07, 2007 at 11:37:01AM +0200, Marcin Ślusarz wrote:
...
   No, i don't need a break. I'll have more time in next weeks.
  
  Great! So, I'll try to send a patch with _SW_RESEND in a few hours,
  if Ingo doesn't prepare something for you.
 
 So, the let's try this idea yet: modified Ingo's x86: activate
 HARDIRQS_SW_RESEND patch.
 (Don't forget about make oldconfig before make.)
 For testing only.
 
 Cheers,
 Jarek P.
 
 PS: alas there was not even time for compile checking...

And here is one more patch to test the same idea (chip-retrigger()).
Let's try i386 way! (I hope I will not be arrested for this...)
(Should be tested without any previous patches.)

Jarek P.

PS: as above

---

diff -Nurp 2.6.22.1-/arch/x86_64/kernel/io_apic.c 
2.6.22.1/arch/x86_64/kernel/io_apic.c
--- 2.6.22.1-/arch/x86_64/kernel/io_apic.c  2007-07-09 01:32:17.0 
+0200
+++ 2.6.22.1/arch/x86_64/kernel/io_apic.c   2007-08-07 14:37:45.0 
+0200
@@ -1311,15 +1311,8 @@ static unsigned int startup_ioapic_irq(u
 static int ioapic_retrigger_irq(unsigned int irq)
 {
struct irq_cfg *cfg = irq_cfg[irq];
-   cpumask_t mask;
-   unsigned long flags;
-
-   spin_lock_irqsave(vector_lock, flags);
-   cpus_clear(mask);
-   cpu_set(first_cpu(cfg-domain), mask);
 
-   send_IPI_mask(mask, cfg-vector);
-   spin_unlock_irqrestore(vector_lock, flags);
+   send_IPI_self(cfg-vector);
 
return 1;
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/8/1, Ingo Molnar [EMAIL PROTECTED]:
 ok, it wasnt supposed to be _that_ easy i guess :-) Can you please
 (re-)confirm that the workaround below indeed fixes the hung card
 problem? (after producing a single WARN_ON message into the syslog)
yes, with this patch everything works fine

end of dmesg:

EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 1020112k swap on /dev/sda2.  Priority:-1 extents:1 across:1020112k
skge eth1: enabling interface
NET: Registered protocol family 17
WARNING: at kernel/irq/resend.c:70 check_irq_resend()

Call Trace:
 [8025e5c8] check_irq_resend+0xa8/0xc0
 [8025e1ca] enable_irq+0xea/0xf0
 [8800f21d] :8390:ei_start_xmit+0x14d/0x30c
 [8052b5ce] dev_hard_start_xmit+0x26e/0x2d0
 [80539b10] __qdisc_run+0xc0/0x1f0
 [8052db9f] dev_queue_xmit+0x24f/0x310
 [880d7ac9] :af_packet:packet_sendmsg+0x259/0x2c0
 [8051f0bf] sock_sendmsg+0xdf/0x110
 [8024b8c9] trace_hardirqs_on+0xd9/0x180
 [8024c1dd] __lock_acquire+0x31d/0xff0
 [80243290] autoremove_wake_function+0x0/0x40
 [803e3103] __up_read+0x23/0xb0
 [803e3125] __up_read+0x45/0xb0
 [805bd8f5] _spin_unlock_irqrestore+0x65/0x80
 [8024b8c9] trace_hardirqs_on+0xd9/0x180
 [803e3125] __up_read+0x45/0xb0
 [802464b6] up_read+0x26/0x30
 [8051f4f1] sys_sendto+0x111/0x150
 [8024b8c9] trace_hardirqs_on+0xd9/0x180
 [805bd93b] _spin_unlock_irq+0x2b/0x60
 [8023861a] do_sigaction+0x11a/0x1d0
 [802097fe] system_call+0x7e/0x83

Marking TSC unstable due to cpufreq changes
Time: acpi_pm clocksource has been installed.

 also, does removing the ne2k-pci module and reinserting it again solve
 the problem too, or is your network card stuck forever once it got into
 that state?
it doesn't change anything - i tried reloading both modules (ne2k_pci and skge)

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Marcin Ślusarz
2007/7/31, Jarek Poplawski [EMAIL PROTECTED]:
 Marcin,

 I see you're quite busy, but if after testing this next Ingo's patch
 you are alive yet, maybe you could try one more idea? No patch this
 time, but if you could try this after adding boot option noirqdebug
 (I'd like to be sure it's not about timinig after all).
It didn't change anything. Network card still timed out.

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar

* Marcin Ślusarz [EMAIL PROTECTED] wrote:

 2007/7/31, Jarek Poplawski [EMAIL PROTECTED]:
  Marcin,
 
  I see you're quite busy, but if after testing this next Ingo's patch
  you are alive yet, maybe you could try one more idea? No patch this
  time, but if you could try this after adding boot option noirqdebug
  (I'd like to be sure it's not about timinig after all).
 It didn't change anything. Network card still timed out.

please try Jarek's second patch too - there was a missing unmask.

Ingo

--
Subject: genirq: fix simple and fasteoi irq handlers
From: Jarek Poplawski [EMAIL PROTECTED]

After the genirq: do not mask interrupts by default patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this don't mask interrupts by default idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find  try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c2007-08-05 21:49:46.0 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
spin_lock(desc-lock);
 
-   if (unlikely(desc-status  IRQ_INPROGRESS))
-   goto out_unlock;
kstat_cpu(cpu).irqs[irq]++;
 
action = desc-action;
-   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
+   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
+IRQ_DISABLED {
if (desc-chip-mask)
desc-chip-mask(irq);
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
+   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
+   desc-chip-unmask(irq);
 out_unlock:
spin_unlock(desc-lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
spin_lock(desc-lock);
 
-   if (unlikely(desc-status  IRQ_INPROGRESS))
-   goto out;
-
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;
 
/*
-* If its disabled or no action available
+* If it's running, disabled or no action available
 * then mask it and get out of here:
 */
action = desc-action;
-   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
+   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
+IRQ_DISABLED {
desc-status |= IRQ_PENDING;
if (desc-chip-mask)
desc-chip-mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
+   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
+   desc-chip-unmask(irq);
 out:
desc-chip-eoi(irq);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 03:03 AM, Ingo Molnar wrote:
 
 But, since level types don't need this retriggers too much I think
 this don't mask interrupts by default idea should be rethinked:
 is there enough gain to risk such hard to diagnose errors?
   
 

I reverted those masking changes in Fedora and the baffling problem
with 3Com 3C905 network adapters went away.

Before, they would print:

eth0: transmit timed out, tx_status 00 status e601.
  diagnostics: net 0ccc media 8880 dma 003a fifo 
eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
  Flags; bus-master 1, dirty 295757(13) current 295757(13)
  Transmit list  vs. f7150a20.
  0: @f7150200  length 8070 status 0c010070
  1: @f71502a0  length 8070 status 0c010070
  2: @f7150340  length 805c status 0c01005c

Now they just work, apparently...

So why not just revert the change?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Ingo Molnar

* Chuck Ebbert [EMAIL PROTECTED] wrote:

 Before, they would print:
 
 eth0: transmit timed out, tx_status 00 status e601.
   diagnostics: net 0ccc media 8880 dma 003a fifo 
 eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
   Flags; bus-master 1, dirty 295757(13) current 295757(13)
   Transmit list  vs. f7150a20.
   0: @f7150200  length 8070 status 0c010070
   1: @f71502a0  length 8070 status 0c010070
   2: @f7150340  length 805c status 0c01005c
 
 Now they just work, apparently...

could you please try the patch below? If this doesnt do the trick then i 
guess we need to revert that change.

Ingo


(take 2)

Subject: genirq: fix simple and fasteoi irq handlers

After the genirq: do not mask interrupts by default patch interrupts
should be disabled not immediately upon request, but after they happen.
But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
driver's work.

The main reason of problems here, pointing the broken patch and making
the first patch which can fix this was done by Marcin Slusarz.
Additional test patches of Thomas Gleixner and Ingo Molnar tested by
Marcin Slusarz helped to narrow possible reasons even more. Thanks.

PS: this patch fixes only one evident error here, but there could be
more places affected by above-mentioned change in irq handling.

PS 2:
After rethinking, IMHO, there are two most probable scenarios here:

1. After hw resend there could be a conflict between retriggered
edge type irq and the next level type one: e.g. if this level type
irq (io_apic is enabled then) is triggered while retriggered irq is
serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
the next such levels are triggered and looping, so probably kind of
flood in io_apic until this retriggered edge service has ended. 
2. There is something wrong with ioapic_retrigger_irq (less probable
because this should be probably seen with 'normal' edge retriggers,
but on the other hand, they could be less common).

So, if there is #1, this fixed patch should work.

But, since level types don't need this retriggers too much I think
this don't mask interrupts by default idea should be rethinked:
is there enough gain to risk such hard to diagnose errors?
  
So, IMHO, there should be at least possibility to turn this off for
level types in config (it should be a visible option, so people could
find  try this before writing for help or changing a network card).


Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]

---

diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
--- 2.6.23-rc1-/kernel/irq/chip.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/kernel/irq/chip.c2007-08-05 21:49:46.0 +0200
@@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
 
spin_lock(desc-lock);
 
-   if (unlikely(desc-status  IRQ_INPROGRESS))
-   goto out_unlock;
kstat_cpu(cpu).irqs[irq]++;
 
action = desc-action;
-   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
+   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
+IRQ_DISABLED {
if (desc-chip-mask)
desc-chip-mask(irq);
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
@@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
 
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
+   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
+   desc-chip-unmask(irq);
 out_unlock:
spin_unlock(desc-lock);
 }
@@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
 
spin_lock(desc-lock);
 
-   if (unlikely(desc-status  IRQ_INPROGRESS))
-   goto out;
-
desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
kstat_cpu(cpu).irqs[irq]++;
 
/*
-* If its disabled or no action available
+* If it's running, disabled or no action available
 * then mask it and get out of here:
 */
action = desc-action;
-   if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
+   if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
+IRQ_DISABLED {
desc-status |= IRQ_PENDING;
if (desc-chip-mask)
desc-chip-mask(irq);
@@ -420,6 +419,8 @@ handle_fasteoi_irq(unsigned int irq, str
 
spin_lock(desc-lock);
desc-status = ~IRQ_INPROGRESS;
+   if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
+   desc-chip-unmask(irq);
 out:
desc-chip-eoi(irq);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
 * Chuck Ebbert [EMAIL PROTECTED] wrote:
 
  Before, they would print:
  
  eth0: transmit timed out, tx_status 00 status e601.
diagnostics: net 0ccc media 8880 dma 003a fifo 
  eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, dirty 295757(13) current 295757(13)
Transmit list  vs. f7150a20.
0: @f7150200  length 8070 status 0c010070
1: @f71502a0  length 8070 status 0c010070
2: @f7150340  length 805c status 0c01005c
  
  Now they just work, apparently...
 
 could you please try the patch below? If this doesnt do the trick then i 
 guess we need to revert that change.

I confirm that the latest fedora kernel 2.6.22.1-41.fc7 (with the removal of 
[PATCH] genirq: do not mask interrupts by default) still work on my machine for 
3 days.

Atm I'm still stressing the network (2 * 3com cards + 1 onboard nvidia card) to 
be sure.

Jb



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Jean-Baptiste Vignaud
Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 3com 
card failed with the latest fedora kernel.

Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status e601.
Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 003a 
fifo 8000
Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ 
blocked by another device?
Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current 
26085000(8)
Aug  6 22:31:09 loki kernel:   Transmit list  vs. 81007c807700.

Stressing eth2 by copying large files on a samba on share and eth0 by 
downloading big files on the internet.

Jb

 
 * Chuck Ebbert [EMAIL PROTECTED] wrote:
 
  Before, they would print:
  
  eth0: transmit timed out, tx_status 00 status e601.
diagnostics: net 0ccc media 8880 dma 003a fifo 
  eth0: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, dirty 295757(13) current 295757(13)
Transmit list  vs. f7150a20.
0: @f7150200  length 8070 status 0c010070
1: @f71502a0  length 8070 status 0c010070
2: @f7150340  length 805c status 0c01005c
  
  Now they just work, apparently...
 
 could you please try the patch below? If this doesnt do the trick then i 
 guess we need to revert that change.
 
   Ingo
 
 
 (take 2)
 
 Subject: genirq: fix simple and fasteoi irq handlers
 
 After the genirq: do not mask interrupts by default patch interrupts
 should be disabled not immediately upon request, but after they happen.
 But, handle_simple_irq() and handle_fasteoi_irq() can skip this once or
 more if an irq is just serviced (IRQ_INPROGRESS), possibly disrupting a
 driver's work.
 
 The main reason of problems here, pointing the broken patch and making
 the first patch which can fix this was done by Marcin Slusarz.
 Additional test patches of Thomas Gleixner and Ingo Molnar tested by
 Marcin Slusarz helped to narrow possible reasons even more. Thanks.
 
 PS: this patch fixes only one evident error here, but there could be
 more places affected by above-mentioned change in irq handling.
 
 PS 2:
 After rethinking, IMHO, there are two most probable scenarios here:
 
 1. After hw resend there could be a conflict between retriggered
 edge type irq and the next level type one: e.g. if this level type
 irq (io_apic is enabled then) is triggered while retriggered irq is
 serviced (IRQ_INPROGRESS) there is goto out with eoi, and probably
 the next such levels are triggered and looping, so probably kind of
 flood in io_apic until this retriggered edge service has ended. 
 2. There is something wrong with ioapic_retrigger_irq (less probable
 because this should be probably seen with 'normal' edge retriggers,
 but on the other hand, they could be less common).
 
 So, if there is #1, this fixed patch should work.
 
 But, since level types don't need this retriggers too much I think
 this don't mask interrupts by default idea should be rethinked:
 is there enough gain to risk such hard to diagnose errors?
   
 So, IMHO, there should be at least possibility to turn this off for
 level types in config (it should be a visible option, so people could
 find  try this before writing for help or changing a network card).
 
 
 Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
 
 ---
 
 diff -Nurp 2.6.23-rc1-/kernel/irq/chip.c 2.6.23-rc1/kernel/irq/chip.c
 --- 2.6.23-rc1-/kernel/irq/chip.c 2007-07-09 01:32:17.0 +0200
 +++ 2.6.23-rc1/kernel/irq/chip.c  2007-08-05 21:49:46.0 +0200
 @@ -295,12 +295,11 @@ handle_simple_irq(unsigned int irq, stru
  
   spin_lock(desc-lock);
  
 - if (unlikely(desc-status  IRQ_INPROGRESS))
 - goto out_unlock;
   kstat_cpu(cpu).irqs[irq]++;
  
   action = desc-action;
 - if (unlikely(!action || (desc-status  IRQ_DISABLED))) {
 + if (unlikely(!action || (desc-status  (IRQ_INPROGRESS |
 +  IRQ_DISABLED {
   if (desc-chip-mask)
   desc-chip-mask(irq);
   desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
 @@ -318,6 +317,8 @@ handle_simple_irq(unsigned int irq, stru
  
   spin_lock(desc-lock);
   desc-status = ~IRQ_INPROGRESS;
 + if (!(desc-status  IRQ_DISABLED)  desc-chip-unmask)
 + desc-chip-unmask(irq);
  out_unlock:
   spin_unlock(desc-lock);
  }
 @@ -392,18 +393,16 @@ handle_fasteoi_irq(unsigned int irq, str
  
   spin_lock(desc-lock);
  
 - if (unlikely(desc-status  IRQ_INPROGRESS))
 - goto out;
 -
   desc-status = ~(IRQ_REPLAY | IRQ_WAITING);
   kstat_cpu(cpu).irqs[irq]++;
  
   /*
 -  * If its disabled or no action available
 +  * If it's running, disabled or no action available
* then mask it and get out of here:
*/
   action = desc-action;
 - if 

Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Chuck Ebbert
On 08/06/2007 04:42 PM, Jean-Baptiste Vignaud wrote:
 Mmm, bad news, after 4 hours of intensive network stressing, one of the 2 
 3com card failed with the latest fedora kernel.
 
 Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
 Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status 
 e601.
 Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma 003a 
 fifo 8000
 Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but not delivered -- IRQ 
 blocked by another device?
 Aug  6 22:31:09 loki kernel:   Flags; bus-master 1, dirty 26085000(8) current 
 26085000(8)
 Aug  6 22:31:09 loki kernel:   Transmit list  vs. 81007c807700.
 
 Stressing eth2 by copying large files on a samba on share and eth0 by 
 downloading big files on the internet.

So even the full revert doesn't fix the 3Com driver, it just makes it less
likely to do that.

The other patch probably won't be any better -- I'd guess there's some
kind of IRQ handling bug in that driver.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-06 Thread Al Boldi
Jean-Baptiste Vignaud wrote:
 Mmm, bad news, after 4 hours of intensive network stressing, one of the 2
 3com card failed with the latest fedora kernel.

 Aug  6 22:31:09 loki kernel: NETDEV WATCHDOG: eth2: transmit timed out
 Aug  6 22:31:09 loki kernel: eth2: transmit timed out, tx_status 00 status
 e601. Aug  6 22:31:09 loki kernel:   diagnostics: net 0ccc media 8880 dma
 003a fifo 8000 Aug  6 22:31:09 loki kernel: eth2: Interrupt posted but
 not delivered -- IRQ blocked by another device? Aug  6 22:31:09 loki
 kernel:   Flags; bus-master 1, dirty 26085000(8) current 26085000(8) Aug 
 6 22:31:09 loki kernel:   Transmit list  vs. 81007c807700.

 Stressing eth2 by copying large files on a samba on share and eth0 by
 downloading big files on the internet.

Next time you want to stress your network you may want to try this:

  # ping 10.1 -s8 -f -l9

or

  # ping 10.1 -s8 -A  /dev/null

BTW, I mentioned this before, there maybe a BIOS irq config mismatch before 
booting the kernel.


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-01 Thread Marcin Ślusarz
2007/7/30, Ingo Molnar [EMAIL PROTECTED]:
 (..)
 does the patch below fix those timeouts? It tests the theory whether any
 POST latency could expose this problem.

 Ingo

 Index: linux/drivers/net/lib8390.c
 ===
 --- linux.orig/drivers/net/lib8390.c
 +++ linux/drivers/net/lib8390.c
 @@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff
 /* Turn 8390 interrupts back on. */
 ei_local-irqlock = 0;
 ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
 +   /* force POST: */
 +   ei_inb_p(e8390_base + EN0_IMR);

 spin_unlock(ei_local-page_lock);
 enable_irq_lockdep_irqrestore(dev-irq, flags);


Bad news. It doesn't fix the problem.

Marcin
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-08-01 Thread Ingo Molnar

* Marcin Ślusarz [EMAIL PROTECTED] wrote:

  ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
  +   /* force POST: */
  +   ei_inb_p(e8390_base + EN0_IMR);
 
  spin_unlock(ei_local-page_lock);
  enable_irq_lockdep_irqrestore(dev-irq, flags);
 
 
 Bad news. It doesn't fix the problem.

ok, it wasnt supposed to be _that_ easy i guess :-) Can you please 
(re-)confirm that the workaround below indeed fixes the hung card 
problem? (after producing a single WARN_ON message into the syslog)

also, does removing the ne2k-pci module and reinserting it again solve 
the problem too, or is your network card stuck forever once it got into 
that state?

Ingo

---
From: Thomas Gleixner [EMAIL PROTECTED]
Subject: genirq: temporary fix for level-triggered IRQ resend

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

17: 12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

disable_irq();
fiddle_with_the_network_card_hardware()
enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 kernel/irq/resend.c |9 +
 1 file changed, 9 insertions(+)

Index: linux/kernel/irq/resend.c
===
--- linux.orig/kernel/irq/resend.c
+++ linux/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *d
 */
desc-chip-enable(irq);
 
+   /*
+* Temporary hack to figure out more about the problem, which
+* is causing the ancient network cards to die.
+*/
+   if (desc-handle_irq != handle_edge_irq) {
+   WARN_ON_ONCE(1);
+   return;
+   }
+
if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-31 Thread Jarek Poplawski
On Mon, Jul 30, 2007 at 09:29:38AM +0200, Marcin Ślusarz wrote:
...
 ps: I retested all patches posted in this thread on top of 2.6.22.1
 and behavior from 2.6.21.3 didn't changed. My next tests will be on
 2.6.22.x only.

Marcin,

I see you're quite busy, but if after testing this next Ingo's patch
you are alive yet, maybe you could try one more idea? No patch this
time, but if you could try this after adding boot option noirqdebug
(I'd like to be sure it's not about timinig after all).

Cheers  thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Marcin Ślusarz
2007/7/26, Ingo Molnar [EMAIL PROTECTED]:
 (..)
 yeah - i meant to cover both arches but forgot about x86_64 - updated
 patch attached below.

 Ingo

 -
 Subject: x86: activate HARDIRQS_SW_RESEND
 From: Ingo Molnar [EMAIL PROTECTED]

 activate the software-triggered IRQ-resend logic.

 it appears some chipsets/cpus do not handle local-APIC driven IRQ
 resends all that well, so always use the soft mechanism to trigger
 the execution of pending interrupts.

 Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
 ---
  arch/i386/Kconfig   |4 
  arch/x86_64/Kconfig |4 
  kernel/irq/manage.c |8 
  3 files changed, 16 insertions(+)

 Index: linux-rt-rebase.q/arch/i386/Kconfig
 ===
 --- linux-rt-rebase.q.orig/arch/i386/Kconfig
 +++ linux-rt-rebase.q/arch/i386/Kconfig
 @@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
 depends on GENERIC_HARDIRQS  SMP
 default y

 +config HARDIRQS_SW_RESEND
 +   bool
 +   default y
 +
  config X86_SMP
 bool
 depends on SMP  !X86_VOYAGER
 Index: linux-rt-rebase.q/arch/x86_64/Kconfig
 ===
 --- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
 +++ linux-rt-rebase.q/arch/x86_64/Kconfig
 @@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
 depends on GENERIC_HARDIRQS  SMP
 default y

 +config HARDIRQS_SW_RESEND
 +   bool
 +   default y
 +
  menu Power management options

  source kernel/power/Kconfig
 Index: linux-rt-rebase.q/kernel/irq/manage.c
 ===
 --- linux-rt-rebase.q.orig/kernel/irq/manage.c
 +++ linux-rt-rebase.q/kernel/irq/manage.c
 @@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
 desc-depth--;
 }
 spin_unlock_irqrestore(desc-lock, flags);
 +#ifdef CONFIG_HARDIRQS_SW_RESEND
 +   /*
 +* Do a bh disable/enable pair to trigger any pending
 +* irq resend logic:
 +*/
 +   local_bh_disable();
 +   local_bh_enable();
 +#endif
  }
  EXPORT_SYMBOL(enable_irq);

This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ps: I retested all patches posted in this thread on top of 2.6.22.1
and behavior from 2.6.21.3 didn't changed. My next tests will be on
2.6.22.x only.

Regards,
Marcin Slusarz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Ingo Molnar

* Alan Cox [EMAIL PROTECTED] wrote:

 Ok the logic behind the 8390 is very simple:

thanks for the explanation Alan! A few comments and a question:

 Things to know
   - IRQ delivery is asynchronous to the PCI bus
   - Blocking the local CPU IRQ via spin locks was too slow
   - The chip has register windows needing locking work
 
 So the path was once (I say once as people appear to have changed it 
 in the mean time and it now looks rather bogus if the changes to use 
 disable_irq_nosync_irqsave are disabling the local IRQ)
 
 
   Take the page lock
   Mask the IRQ on chip
   Disable the IRQ (but not mask locally- someone seems to have
   broken this with the lock validator stuff)
   [This must be _nosync as the page lock may otherwise
   deadlock us]

( side-note: you can ignore the lock validator stuff here, the validator
  changes are supposed to a NOP on the !lockdep case. Local irqs will
  only be disabled if the validator is running. This could cause dropped
  serial irqs on very old boxes but i doubt anyone will want to run the
  validator on those. )

   Drop the page lock and turn IRQs back on
   
   At this point an existing IRQ may still be running but we can't
   get a new one
 
   Take the lock (so we know the IRQ has terminated) but don't mask
 the IRQs on the processor
   Set irqlock [for debug]
 
   Transmit (slow as )
 
   re-enable the IRQ
 
 
 We have to use disable_irq because otherwise you will get delayed 
 interrupts on the APIC bus deadlocking the transmit path.
 
 Quite hairy but the chip simply wasn't designed for SMP and you can't 
 even ACK an interrupt without risking corrupting other parallel 
 activities on the chip.

So the whole locking is to be able to keep irqs enabled for a long time, 
without risking entry of the same IRQ handler on this same CPU, correct?

Marcin's test results suggest that if an IRQ is resent right at the 
enable_irq() point [be that via the hw irq-resend mechanism or the sw 
irq-resend mechanism], the hang happens.

In the previous 2.6.20 logic we'd not normally generate an IRQ at that 
point (because we masked the irq and the card itself deasserts the line 
so any level-triggered irq is now moot).

Once Thomas hacked off this resend mechanism for level-triggered irqs, 
Marcin saw the hangs go away.

So it seems to me that maybe the driver could be surprised via these 
spurious interrupts that happen right after the irq_enable(). Does the 
patch below make any sense in your opinion?

Ingo

Index: linux/drivers/net/lib8390.c
===
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
/* Turn 8390 interrupts back on. */
ei_local-irqlock = 0;
ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+   /* force POST: */
+   ei_inb_p(e8390_base + EN0_IMR);
 
spin_unlock(ei_local-page_lock);
enable_irq_lockdep_irqrestore(dev-irq, flags);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Ingo Molnar

* Marcin Ślusarz [EMAIL PROTECTED] wrote:

  Subject: x86: activate HARDIRQS_SW_RESEND
  From: Ingo Molnar [EMAIL PROTECTED]
 
  activate the software-triggered IRQ-resend logic.

 This patch didn't help (tested on 2.6.22.1) - ne2k_pci timed out.

ok. This makes it more likely that the driver itself (or the card) gets 
confused by the resend.

does the patch below fix those timeouts? It tests the theory whether any 
POST latency could expose this problem.

Ingo

Index: linux/drivers/net/lib8390.c
===
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -375,6 +375,8 @@ static int ei_start_xmit(struct sk_buff 
/* Turn 8390 interrupts back on. */
ei_local-irqlock = 0;
ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
+   /* force POST: */
+   ei_inb_p(e8390_base + EN0_IMR);
 
spin_unlock(ei_local-page_lock);
enable_irq_lockdep_irqrestore(dev-irq, flags);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Alan Cox
 So the whole locking is to be able to keep irqs enabled for a long time, 
 without risking entry of the same IRQ handler on this same CPU, correct?

As implemented - on any CPU.

We also need to know that the IRQ handler is not doing useful work on
another processor which is why we take the lock after disabling the
interrupt line everywhere. Without that we might be completing an IRQ on
another CPU and that would race the transmit and make a nasty mess.

 So it seems to me that maybe the driver could be surprised via these 
 spurious interrupts that happen right after the irq_enable(). Does the 
 patch below make any sense in your opinion?

For MMIO it does look like that may be needed. Looks sensible.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-30 Thread Jeff Garzik

Jarek Poplawski wrote:

Hi,

Very below is my patch proposal with a comment, which in my opinion
is precious enough to save it for future help in reading and
understanding the code.

I hope Alan will not blame me I've not asked for his permission before
sending, and he would ack this patch as it is or at least most of this.

Thanks  regards,
Jarek P.

On Wed, Jul 25, 2007 at 03:46:56PM +0100, Alan Cox wrote:

The code in question lib8390.c does

disable_irq();
fiddle_with_the_network_card_hardware()
enable_irq();

...

No idea how this affects the network card, as the code there must be
able to handle interrupts, which are not originated from the card due to
interrupt sharing.

I think, in this last yesterday's patch Ingo could be right, yet!
The comment at the beginnig points this is done like that because
of chip's slowness. And problems with timing are mysterious.

On the other hand author of this code didn't use spin_lock_irqsave
for some reason, probably after testing this option too. So, I hope
this is the right path, but alas, I'm not sure this patch has to
prove this 100%.

The author (me) didn't use spin_lock_irqsave because the slowness of the
card means that approach caused horrible problems like losing serial data
at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
chips with FPGA front ends.


Anyway, in my opinion this situation where interrupts could/have_to
be used for such strange things should confirm the need of more
options for handling irqs individually.

Ok the logic behind the 8390 is very simple:

Things to know
- IRQ delivery is asynchronous to the PCI bus
- Blocking the local CPU IRQ via spin locks was too slow
- The chip has register windows needing locking work

So the path was once (I say once as people appear to have changed it
in the mean time and it now looks rather bogus if the changes to use
disable_irq_nosync_irqsave are disabling the local IRQ)


Take the page lock
Mask the IRQ on chip
Disable the IRQ (but not mask locally- someone seems to have
broken this with the lock validator stuff)
[This must be _nosync as the page lock may otherwise
deadlock us]
Drop the page lock and turn IRQs back on

At this point an existing IRQ may still be running but we can't
get a new one

Take the lock (so we know the IRQ has terminated) but don't mask
the IRQs on the processor
Set irqlock [for debug]

Transmit (slow as )

re-enable the IRQ


We have to use disable_irq because otherwise you will get delayed
interrupts on the APIC bus deadlocking the transmit path.

Quite hairy but the chip simply wasn't designed for SMP and you can't
even ACK an interrupt without risking corrupting other parallel
activities on the chip.

Alan


--

From: Jarek Poplawski [EMAIL PROTECTED]

Subject: lib8390: comment on locking by Alan Cox

Additional explanation of problems with locking by Alan Cox.

Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
Cc: Alan Cox [EMAIL PROTECTED]
Cc: Paul Gortmaker [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]


applied


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Jarek Poplawski
On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
...
 yeah. The patch below enables sw-resend on x86, to test the theory 
 whether the APIC-driven hardware-vector-resend code has some problem.

I think Marcin is using x86_64 (Athlon 64) yet.

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Ingo Molnar

* Marcin Ślusarz [EMAIL PROTECTED] wrote:

 2007/7/25, Thomas Gleixner [EMAIL PROTECTED]:
 (...)
 
 I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas' 
 patch (one patch at time of course) - all of them fixed the problem, 
 but the last one flooded my logs with Skip resend for irq 17. All 
 tests were done on 2.6.21.3.

that's great! I think we have two good theories about what might be 
going on:

 - the driver might be buggy in that it gets confused by the 'resent' 
   irq.

 - or the chipset/cpu has a bug where it might get confused about the
   resent APIC vector getting mixed up with the same vector coming
   externally too. (Now, it makes little sense to 'resend' a
   level-triggered interrupt on x86 platforms that have flat PIC 
   hierarchies (other architectures might need more than that to
   retrigger an interrupt) - but there's nothing wrong about it in 
   theory and it needs fixing for edge irqs anyway.)

in any case, the problem was triggered by our change generating much 
more resent irqs than before. Nevertheless we'd like to fix that resend 
bug (and if the driver is buggy, the driver bug too). It's really good 
progress so far - we are working on doing the real fix now.

Ingo
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Ingo Molnar

* Thomas Gleixner [EMAIL PROTECTED] wrote:

 The other question is:
 
 Is the driver confused by the resent irq or is the chip-set unhappy 
 about the resend ?
 
 We could figure the latter out by activating the software based resend 
 method.

yeah. The patch below enables sw-resend on x86, to test the theory 
whether the APIC-driven hardware-vector-resend code has some problem.

Marcin, could you please give this one a try too? Good behavior would be 
a fully working kernel (no hung device) with no extra kernel messages. 
Bad behavior would be any extra kernel message or any non-working 
device.

Ingo

-
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar [EMAIL PROTECTED]

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 arch/i386/Kconfig   |4 
 kernel/irq/manage.c |8 
 2 files changed, 12 insertions(+)

Index: linux/arch/i386/Kconfig
===
--- linux.orig/arch/i386/Kconfig
+++ linux/arch/i386/Kconfig
@@ -1270,6 +1270,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS  SMP
default y
 
+config HARDIRQS_SW_RESEND
+   bool
+   default y
+
 config X86_SMP
bool
depends on SMP  !X86_VOYAGER
Index: linux/kernel/irq/manage.c
===
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -181,6 +181,14 @@ void enable_irq(unsigned int irq)
desc-depth--;
}
spin_unlock_irqrestore(desc-lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+   /*
+* Do a bh disable/enable pair to trigger any pending
+* irq resend logic:
+*/
+   local_bh_disable();
+   local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Thomas Gleixner
On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
  I wanted to test them all on 2.6.22.1, but I didn't have enough time.
  I've verified only that 2.6.22.1 has the same problem. I can test it
  later, but I can report results back at beginning of next week.
 
 
 So, everything is clear - any changes are good!
 Except the signed-off ones... 
 
 Thanks Marcin,
 Jarek P.
 
 PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
 give me some tip, how these re-triggered interrupts are skipped on dev's
 reset before enable_irq?

I think the correct solution is really not to resend level type
interrupts. If the interrupt line is still active, then the interrupt
comes up by itself. I'm cooking a patch for that.

The other question is: 

Is the driver confused by the resent irq or is the chip-set unhappy
about the resend ?

We could figure the latter out by activating the software based resend
method.

tglx


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Jarek Poplawski
On Thu, Jul 26, 2007 at 09:16:10AM +0200, Marcin Ślusarz wrote:
 2007/7/25, Thomas Gleixner [EMAIL PROTECTED]:
 (...)
 
 I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
 patch (one patch at time of course) - all of them fixed the problem,
 but the last one flooded my logs with Skip resend for irq 17. All
 tests were done on 2.6.21.3.
 
 I wanted to test them all on 2.6.22.1, but I didn't have enough time.
 I've verified only that 2.6.22.1 has the same problem. I can test it
 later, but I can report results back at beginning of next week.


So, everything is clear - any changes are good!
Except the signed-off ones... 

Thanks Marcin,
Jarek P.

PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
give me some tip, how these re-triggered interrupts are skipped on dev's
reset before enable_irq?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Jarek Poplawski
On Thu, Jul 26, 2007 at 10:10:31AM +0200, Thomas Gleixner wrote:
 On Thu, 2007-07-26 at 10:13 +0200, Jarek Poplawski wrote:
...
  PS: Now, it seems to me Thomas could be the nearest. BTW, could somebody
  give me some tip, how these re-triggered interrupts are skipped on dev's
  reset before enable_irq?
 
 I think the correct solution is really not to resend level type
 interrupts. If the interrupt line is still active, then the interrupt
 comes up by itself. I'm cooking a patch for that.
 
 The other question is: 
 
 Is the driver confused by the resent irq or is the chip-set unhappy
 about the resend ?
 
 We could figure the latter out by activating the software based resend
 method.

Probably I miss something, but isn't there any problem with level type,
when APIC re-triggers an interrupt, which is not acked by driver nor
card (after some hw reset/clear)?

Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Ingo Molnar

* Jarek Poplawski [EMAIL PROTECTED] wrote:

 On Thu, Jul 26, 2007 at 10:31:20AM +0200, Ingo Molnar wrote:
 ...
  yeah. The patch below enables sw-resend on x86, to test the theory 
  whether the APIC-driven hardware-vector-resend code has some problem.
 
 I think Marcin is using x86_64 (Athlon 64) yet.

yeah - i meant to cover both arches but forgot about x86_64 - updated 
patch attached below.

Ingo

-
Subject: x86: activate HARDIRQS_SW_RESEND
From: Ingo Molnar [EMAIL PROTECTED]

activate the software-triggered IRQ-resend logic.

it appears some chipsets/cpus do not handle local-APIC driven IRQ
resends all that well, so always use the soft mechanism to trigger
the execution of pending interrupts.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 arch/i386/Kconfig   |4 
 arch/x86_64/Kconfig |4 
 kernel/irq/manage.c |8 
 3 files changed, 16 insertions(+)

Index: linux-rt-rebase.q/arch/i386/Kconfig
===
--- linux-rt-rebase.q.orig/arch/i386/Kconfig
+++ linux-rt-rebase.q/arch/i386/Kconfig
@@ -1284,6 +1284,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS  SMP
default y
 
+config HARDIRQS_SW_RESEND
+   bool
+   default y
+
 config X86_SMP
bool
depends on SMP  !X86_VOYAGER
Index: linux-rt-rebase.q/arch/x86_64/Kconfig
===
--- linux-rt-rebase.q.orig/arch/x86_64/Kconfig
+++ linux-rt-rebase.q/arch/x86_64/Kconfig
@@ -721,6 +721,10 @@ config GENERIC_PENDING_IRQ
depends on GENERIC_HARDIRQS  SMP
default y
 
+config HARDIRQS_SW_RESEND
+   bool
+   default y
+
 menu Power management options
 
 source kernel/power/Kconfig
Index: linux-rt-rebase.q/kernel/irq/manage.c
===
--- linux-rt-rebase.q.orig/kernel/irq/manage.c
+++ linux-rt-rebase.q/kernel/irq/manage.c
@@ -175,6 +175,14 @@ void enable_irq(unsigned int irq)
desc-depth--;
}
spin_unlock_irqrestore(desc-lock, flags);
+#ifdef CONFIG_HARDIRQS_SW_RESEND
+   /*
+* Do a bh disable/enable pair to trigger any pending
+* irq resend logic:
+*/
+   local_bh_disable();
+   local_bh_enable();
+#endif
 }
 EXPORT_SYMBOL(enable_irq);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Jarek Poplawski
On Thu, Jul 26, 2007 at 10:13:26AM +0200, Jarek Poplawski wrote:
...
 So, everything is clear - any changes are good!
 Except the signed-off ones... 

Oops! Marcin's patch was both signed-off and good.
So, there is probably something more...

Sorry Marcin,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Alan Cox
On Thu, 26 Jul 2007 14:44:01 +0200
Jarek Poplawski [EMAIL PROTECTED] wrote:

 Hi,
 
 Very below is my patch proposal with a comment, which in my opinion
 is precious enough to save it for future help in reading and
 understanding the code.
 
 I hope Alan will not blame me I've not asked for his permission before
 sending, and he would ack this patch as it is or at least most of this.

Fine by me

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][netdrvr] lib8390: comment on locking by Alan Cox Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Jarek Poplawski
Hi,

Very below is my patch proposal with a comment, which in my opinion
is precious enough to save it for future help in reading and
understanding the code.

I hope Alan will not blame me I've not asked for his permission before
sending, and he would ack this patch as it is or at least most of this.

Thanks  regards,
Jarek P.

On Wed, Jul 25, 2007 at 03:46:56PM +0100, Alan Cox wrote:
   The code in question lib8390.c does
   
 disable_irq();
 fiddle_with_the_network_card_hardware()
 enable_irq();
  ...
   
   No idea how this affects the network card, as the code there must be
   able to handle interrupts, which are not originated from the card due to
   interrupt sharing.
  
  I think, in this last yesterday's patch Ingo could be right, yet!
  The comment at the beginnig points this is done like that because
  of chip's slowness. And problems with timing are mysterious.
  
  On the other hand author of this code didn't use spin_lock_irqsave
  for some reason, probably after testing this option too. So, I hope
  this is the right path, but alas, I'm not sure this patch has to
  prove this 100%.
 
 The author (me) didn't use spin_lock_irqsave because the slowness of the
 card means that approach caused horrible problems like losing serial data
 at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
 chips with FPGA front ends.
 
  Anyway, in my opinion this situation where interrupts could/have_to
  be used for such strange things should confirm the need of more
  options for handling irqs individually.
 
 Ok the logic behind the 8390 is very simple:
 
 Things to know
   - IRQ delivery is asynchronous to the PCI bus
   - Blocking the local CPU IRQ via spin locks was too slow
   - The chip has register windows needing locking work
 
 So the path was once (I say once as people appear to have changed it
 in the mean time and it now looks rather bogus if the changes to use
 disable_irq_nosync_irqsave are disabling the local IRQ)
 
 
   Take the page lock
   Mask the IRQ on chip
   Disable the IRQ (but not mask locally- someone seems to have
   broken this with the lock validator stuff)
   [This must be _nosync as the page lock may otherwise
   deadlock us]
   Drop the page lock and turn IRQs back on
   
   At this point an existing IRQ may still be running but we can't
   get a new one
 
   Take the lock (so we know the IRQ has terminated) but don't mask
 the IRQs on the processor
   Set irqlock [for debug]
 
   Transmit (slow as )
 
   re-enable the IRQ
 
 
 We have to use disable_irq because otherwise you will get delayed
 interrupts on the APIC bus deadlocking the transmit path.
 
 Quite hairy but the chip simply wasn't designed for SMP and you can't
 even ACK an interrupt without risking corrupting other parallel
 activities on the chip.
 
 Alan
 
--

From: Jarek Poplawski [EMAIL PROTECTED]

Subject: lib8390: comment on locking by Alan Cox

Additional explanation of problems with locking by Alan Cox.

Signed-off-by: Jarek Poplawski [EMAIL PROTECTED]
Cc: Alan Cox [EMAIL PROTECTED]
Cc: Paul Gortmaker [EMAIL PROTECTED]
Cc: Jeff Garzik [EMAIL PROTECTED]

---

diff -Nurp 2.6.23-rc1-/drivers/net/lib8390.c 2.6.23-rc1/drivers/net/lib8390.c
--- 2.6.23-rc1-/drivers/net/lib8390.c   2007-07-09 01:32:17.0 +0200
+++ 2.6.23-rc1/drivers/net/lib8390.c2007-07-26 13:55:17.0 +0200
@@ -143,6 +143,52 @@ static void __NS8390_init(struct net_dev
  * annoying the transmit function is called bh atomic. That places
  * restrictions on the user context callers as disable_irq won't save
  * them.
+ *
+ * Additional explanation of problems with locking by Alan Cox:
+ *
+ * The author (me) didn't use spin_lock_irqsave because the slowness of 
the
+ * card means that approach caused horrible problems like losing serial 
data
+ * at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
+ * chips with FPGA front ends.
+ * 
+ * Ok the logic behind the 8390 is very simple:
+ * 
+ * Things to know
+ * - IRQ delivery is asynchronous to the PCI bus
+ * - Blocking the local CPU IRQ via spin locks was too slow
+ * - The chip has register windows needing locking work
+ * 
+ * So the path was once (I say once as people appear to have changed it
+ * in the mean time and it now looks rather bogus if the changes to use
+ * disable_irq_nosync_irqsave are disabling the local IRQ)
+ * 
+ * 
+ * Take the page lock
+ * Mask the IRQ on chip
+ * Disable the IRQ (but not mask locally- someone seems to have
+ * broken this with the lock validator stuff)
+ * [This must be _nosync as the page lock may otherwise
+ * deadlock us]
+ * Drop the page lock and turn IRQs back on
+ *  

Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-26 Thread Marcin Ślusarz

2007/7/25, Thomas Gleixner [EMAIL PROTECTED]:

(...)


I've tested Jarek's patch, 2 Ingo's patches (2nd and 3rd) and Thomas'
patch (one patch at time of course) - all of them fixed the problem,
but the last one flooded my logs with Skip resend for irq 17. All
tests were done on 2.6.21.3.

I wanted to test them all on 2.6.22.1, but I didn't have enough time.
I've verified only that 2.6.22.1 has the same problem. I can test it
later, but I can report results back at beginning of next week.

Regards,
Marcin Slusarz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-25 Thread Jarek Poplawski
On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
 On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
  Marcin, could you try the patch below too? [without having any other 
  patch applied.] It basically turns the critical section into an irqs-off 
  critical section and thus checks whether your problem is related to that 
  particular area of code.
  
 
 I read back on this thread and I think the problem is somewhere else:

So do I. Of course, I certainly miss most of the details, but I can't
imagine how this yesterday Ingo's patch couldn't work - unless
Marcin's test wasn't long enough...

IMHO, the main problem is that such delicate things shouldn't be
changed this way. If current ideas work for Marcin they will probably
break other boxes. Very similar symptoms were reported before Ingo's
patch too, so it looks like this place is very fragile. If such
things could happen:

(from: arch/i386/kernel/io_apic.c)
 static void ack_ioapic_quirk_irq(unsigned int irq)
 ...
 /*
  * It appears there is an erratum which affects at least version 0x11
  * of I/O APIC (that's the 82093AA and cores integrated into various
  * chipsets).  Under certain conditions a level-triggered interrupt is
  * erroneously delivered as edge-triggered one but the respective IRR
  * bit gets set nevertheless.  As a result the I/O unit expects an EOI
  * message but it will never arrive and further interrupts are blocked
  * from the source.  The exact reason is so far unknown, but the
  * phenomenon was observed when two consecutive interrupt requests
  * from a given source get delivered to the same CPU and the source is
  * temporarily disabled in between.
...

there is no reason to think this is all.

I can also see this comment in arch/x86_64/kernel/io_apic.c:

 static void setup_IO_APIC_irq(int apic, int pin, unsigned int irq,
   int trigger, int polarity)
...
/* Mask level triggered irqs.
 * Use IRQ_DELAYED_DISABLE for edge triggered irqs.
 */

It seems somebody have seen a difference, probably after testing,
but it wasn't respected.

I also presume ne2k/lib8390.c solution could be a result of real
life, and I don't think Marcin's tests can be enough here. 

So, my point is that such places first of all need some documented
knobs in config or elsewere, which make it possible for users to
easily go back to previous method (i.e. from 2.6.21 to 2.6.20 here).

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-25 Thread Jarek Poplawski
On Wed, Jul 25, 2007 at 02:19:31AM +0200, Thomas Gleixner wrote:
...
 Looking into the IO_APIC code, the resend via send_IPI_self() happens
 unconditionally. So the resend is done for level and edge interrupts.
 This makes the problem more mysterious.
 
 The code in question lib8390.c does
 
   disable_irq();
   fiddle_with_the_network_card_hardware()
   enable_irq();
...
 
 No idea how this affects the network card, as the code there must be
 able to handle interrupts, which are not originated from the card due to
 interrupt sharing.

I think, in this last yesterday's patch Ingo could be right, yet!
The comment at the beginnig points this is done like that because
of chip's slowness. And problems with timing are mysterious.

On the other hand author of this code didn't use spin_lock_irqsave
for some reason, probably after testing this option too. So, I hope
this is the right path, but alas, I'm not sure this patch has to
prove this 100%.

Anyway, in my opinion this situation where interrupts could/have_to
be used for such strange things should confirm the need of more
options for handling irqs individually.

Thanks,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-25 Thread Alan Cox
  The code in question lib8390.c does
  
  disable_irq();
  fiddle_with_the_network_card_hardware()
  enable_irq();
 ...
  
  No idea how this affects the network card, as the code there must be
  able to handle interrupts, which are not originated from the card due to
  interrupt sharing.
 
 I think, in this last yesterday's patch Ingo could be right, yet!
 The comment at the beginnig points this is done like that because
 of chip's slowness. And problems with timing are mysterious.
 
 On the other hand author of this code didn't use spin_lock_irqsave
 for some reason, probably after testing this option too. So, I hope
 this is the right path, but alas, I'm not sure this patch has to
 prove this 100%.

The author (me) didn't use spin_lock_irqsave because the slowness of the
card means that approach caused horrible problems like losing serial data
at 38400 baud on some chips. Rememeber many 8390 nics on PCI were ISA
chips with FPGA front ends.

 Anyway, in my opinion this situation where interrupts could/have_to
 be used for such strange things should confirm the need of more
 options for handling irqs individually.

Ok the logic behind the 8390 is very simple:

Things to know
- IRQ delivery is asynchronous to the PCI bus
- Blocking the local CPU IRQ via spin locks was too slow
- The chip has register windows needing locking work

So the path was once (I say once as people appear to have changed it
in the mean time and it now looks rather bogus if the changes to use
disable_irq_nosync_irqsave are disabling the local IRQ)


Take the page lock
Mask the IRQ on chip
Disable the IRQ (but not mask locally- someone seems to have
broken this with the lock validator stuff)
[This must be _nosync as the page lock may otherwise
deadlock us]
Drop the page lock and turn IRQs back on

At this point an existing IRQ may still be running but we can't
get a new one

Take the lock (so we know the IRQ has terminated) but don't mask
the IRQs on the processor
Set irqlock [for debug]

Transmit (slow as )

re-enable the IRQ


We have to use disable_irq because otherwise you will get delayed
interrupts on the APIC bus deadlocking the transmit path.

Quite hairy but the chip simply wasn't designed for SMP and you can't
even ACK an interrupt without risking corrupting other parallel
activities on the chip.

Alan
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Jarek Poplawski
On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
 Ok, I've bisected this problem and found that this patch broke my NIC:
 
 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
 commit 76d2160147f43f982dfe881404cfde9fd0a9da21
 Author: Ingo Molnar [EMAIL PROTECTED]
 Date:   Fri Feb 16 01:28:24 2007 -0800
 
[PATCH] genirq: do not mask interrupts by default
 
Never mask interrupts immediately upon request.  Disabling interrupts in
high-performance codepaths is rare, and on the other hand this change 
could
recover lost edges (or even other types of lost interrupts) by
 conservatively
only masking interrupts after they happen.  (NOTE: with this change the
highlevel irq-disable code still soft-disables this IRQ line - and
 if such an
interrupt happens then the IRQ flow handler keeps the IRQ masked.)
 
Mark i8529A controllers as 'never loses an edge'.
 
Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
Cc: Thomas Gleixner [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

So, it seems nobody (except the users) cares...

BTW, maybe there should be created something like Network Cards
Producers Made Rich on Unnecessary Changed Cards Linux Foundation?:

On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
...
 2) changed the 3com cards
 i replaced by two cards,
 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
 
 reinstalled and stressed the network (small download from a laptop) and :
 
 Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
 Jun 29 09:34:51 loki last message repeated 14 times
 Jun 29 09:35:18 loki last message repeated 8 times

...Of course, no response of any serious developer for this as well.

BTW #2: I wonder how true is this (after above-mentioned patch):

From include/linux/irq.h:
 /**
  * struct irq_chip - hardware interrupt chip descriptor
...
  * @disable:  disable the interrupt (defaults to chip-mask if NULL)

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Ingo Molnar

* Marcin Ślusarz [EMAIL PROTECTED] wrote:

 Ok, I've bisected this problem and found that this patch broke my NIC:
 
 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
 commit 76d2160147f43f982dfe881404cfde9fd0a9da21
 Author: Ingo Molnar [EMAIL PROTECTED]
 Date:   Fri Feb 16 01:28:24 2007 -0800
 
[PATCH] genirq: do not mask interrupts by default

thanks for tracking it down! Could you try the patch below (ontop an 
otherwise unmodified kernel)? This tests the theory whether the problem 
is related to the disable_irq_nosync() call in the ne2k driver's xmit 
path. Does this solve the hangs too?

Ingo

Index: linux/kernel/irq/manage.c
===
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
spin_lock_irqsave(desc-lock, flags);
if (!desc-depth++) {
desc-status |= IRQ_DISABLED;
-   desc-chip-disable(irq);
+   /*
+* the _nosync variant of irq-disable suggests that the
+* caller is not worried about concurrency but about the
+* ordering of the irq flow itself. (such as hardware
+* getting confused about certain, normally valid irq
+* handling sequences.) So if the default disable handler
+* is in place then try the more conservative masking
+* instead:
+*/
+   if (desc-chip-disable == default_disable  desc-chip-mask)
+   desc-chip-mask(irq);
+   else
+   desc-chip-disable(irq);
}
spin_unlock_irqrestore(desc-lock, flags);
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Ingo Molnar

* Ingo Molnar [EMAIL PROTECTED] wrote:

 thanks for tracking it down! Could you try the patch below (ontop an 
 otherwise unmodified kernel)? This tests the theory whether the 
 problem is related to the disable_irq_nosync() call in the ne2k 
 driver's xmit path. Does this solve the hangs too?

please try the patch below instead.

Ingo

Index: linux/kernel/irq/chip.c
===
--- linux.orig/kernel/irq/chip.c
+++ linux/kernel/irq/chip.c
@@ -231,7 +231,7 @@ static void default_enable(unsigned int 
 /*
  * default disable function
  */
-static void default_disable(unsigned int irq)
+void default_disable(unsigned int irq)
 {
 }
 
Index: linux/kernel/irq/internals.h
===
--- linux.orig/kernel/irq/internals.h
+++ linux/kernel/irq/internals.h
@@ -10,6 +10,8 @@ extern void irq_chip_set_defaults(struct
 /* Set default handler: */
 extern void compat_irq_chip_set_default_handler(struct irq_desc *desc);
 
+extern void default_disable(unsigned int irq);
+
 #ifdef CONFIG_PROC_FS
 extern void register_irq_proc(unsigned int irq);
 extern void register_handler_proc(unsigned int irq, struct irqaction *action);
Index: linux/kernel/irq/manage.c
===
--- linux.orig/kernel/irq/manage.c
+++ linux/kernel/irq/manage.c
@@ -102,7 +102,19 @@ void disable_irq_nosync(unsigned int irq
spin_lock_irqsave(desc-lock, flags);
if (!desc-depth++) {
desc-status |= IRQ_DISABLED;
-   desc-chip-disable(irq);
+   /*
+* the _nosync variant of irq-disable suggests that the
+* caller is not worried about concurrency but about the
+* ordering of the irq flow itself. (such as hardware
+* getting confused about certain, normally valid irq
+* handling sequences.) So if the default disable handler
+* is in place then try the more conservative masking
+* instead:
+*/
+   if (desc-chip-disable == default_disable  desc-chip-mask)
+   desc-chip-mask(irq);
+   else
+   desc-chip-disable(irq);
}
spin_unlock_irqrestore(desc-lock, flags);
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Linus Torvalds


On Tue, 24 Jul 2007, Ingo Molnar wrote:
 
 please try the patch below instead.

I'm hoping this is just a let's see if the behavior changes patch, not 
something that you think should be applied if it fixes something?

This patch looks like it is trying to paper over (rather than fix) some 
possible bug in the -disable logic. Makes sense as a let's see if it's 
that kind of thing, but not as a let's fix it.

Or am I missing something?

Linus
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Tue, 24 Jul 2007, Ingo Molnar wrote:
  
  please try the patch below instead.
 
 I'm hoping this is just a let's see if the behavior changes patch, 
 not something that you think should be applied if it fixes something?
 
 This patch looks like it is trying to paper over (rather than fix) 
 some possible bug in the -disable logic. Makes sense as a let's 
 see if it's that kind of thing, but not as a let's fix it.
 
 Or am I missing something?

yeah - it's a totaly bad and unacceptable hack (i realized how bad it 
was when i wrote up that comment section ...), i just wanted to see 
which portion of ne2k/lib8390.c is sensitive to the fact whether an irq 
line is masked or not. The patch has no SOB line either.

the current best fix forward is to undo my original change, unless we 
find a better fix for this problem. (Note that the other patches posted 
in this thread are broken too: they only mask the irq but dont reliably 
unmask it.)

here's the current method of handling irqs for Marcin's card:

17: 12   IO-APIC-fasteoi   eth1, eth0

and fasteoi is a really simple sequence: no masking/unmasking by the 
flow handler itself but a NOP at entry and an APIC-EOI at the end. The 
disable/enable irq thing should thus have minimal effect if done within 
an irq handler.

now ne2k does something uncommon: for xmit (which is normally done 
outside of irq handlers) it will disable_irq_nosync()/enable_irq() the 
interrupt. It does it to exclude the handler from _that_ CPU, but due to 
the _nosync it does not exclude it from any other CPUs. So that's a bit 
weird already.

just in case, i've just re-checked all the genirq bits that change 
IRQ_DISABLED (that bit accidentally clear would be the only way to truly 
allow an IRQ handler to interrupt the disable_irq_nosync() critical 
section on that CPU) - but i can see no way for that to happen: we 
unconditionally detect and report unbalanced and underflowing 
irq_desc-depth, and the only other place (besides enable_irq()) that 
clears IRQ_DISABLED is __set_irq_handler(), and on x86 that is only used 
during bootup.

Marcin, could you try the patch below too? [without having any other 
patch applied.] It basically turns the critical section into an irqs-off 
critical section and thus checks whether your problem is related to that 
particular area of code.

Ingo

Index: linux/drivers/net/lib8390.c
===
--- linux.orig/drivers/net/lib8390.c
+++ linux/drivers/net/lib8390.c
@@ -297,9 +297,7 @@ static int ei_start_xmit(struct sk_buff 
 *  Slow phase with lock held.
 */
 
-   disable_irq_nosync_lockdep_irqsave(dev-irq, flags);
-
-   spin_lock(ei_local-page_lock);
+   spin_lock_irqsave(ei_local-page_lock, flags);
 
ei_local-irqlock = 1;
 
@@ -376,8 +374,7 @@ static int ei_start_xmit(struct sk_buff 
ei_local-irqlock = 0;
ei_outb_p(ENISR_ALL, e8390_base + EN0_IMR);
 
-   spin_unlock(ei_local-page_lock);
-   enable_irq_lockdep_irqrestore(dev-irq, flags);
+   spin_unlock_irqrestore(ei_local-page_lock, flags);
 
dev_kfree_skb (skb);
ei_local-stat.tx_bytes += send_length;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-24 Thread Thomas Gleixner
On Tue, 2007-07-24 at 22:04 +0200, Ingo Molnar wrote:
 Marcin, could you try the patch below too? [without having any other 
 patch applied.] It basically turns the critical section into an irqs-off 
 critical section and thus checks whether your problem is related to that 
 particular area of code.
 

I read back on this thread and I think the problem is somewhere else:

delayed disable relies on the ability to re-trigger the interrupt in the
case that a real interrupt happens after the software disable was set.
In this case we actually disable the interrupt on the hardware level
_after_ it occurred.

On enable_irq, we need to re-trigger the interrupt. On i386 this relies
on a hardware resend mechanism (send_IPI_self()). 

Actually we only need the resend for edge type interrupts. Level type
interrupts come back once enable_irq() re-enables the interrupt line.

I assume that the interrupt in question is level triggered because it is
shared and above the legacy irqs 0-15:

17: 12   IO-APIC-fasteoi   eth1, eth0

Looking into the IO_APIC code, the resend via send_IPI_self() happens
unconditionally. So the resend is done for level and edge interrupts.
This makes the problem more mysterious.

The code in question lib8390.c does

disable_irq();
fiddle_with_the_network_card_hardware()
enable_irq();

The fiddle_with_the_network_card_hardware() might cause interrupts,
which are cleared in the same code path again,

Marcin found that when he disables the irq line on the hardware level
(removing the delayed disable) the card is kept alive.

So the difference is that we can get a resend on enable_irq, when an
interrupt happens during the time, where we are in the disabled region.

No idea how this affects the network card, as the code there must be
able to handle interrupts, which are not originated from the card due to
interrupt sharing.

Marcin, can you please try the patch below ? It's just a debugging aid
to gather some more data about that problem.

If the patch fixes the problem, then we should try to disable the resend
mechanism for not edge type irq lines on the irq_chip level (i.e. the
IOAPIC code)

Thanks,

tglx

--- linux-2.6.orig/kernel/irq/resend.c
+++ linux-2.6/kernel/irq/resend.c
@@ -62,6 +62,15 @@ void check_irq_resend(struct irq_desc *desc, unsigned int 
irq)
 */
desc-chip-enable(irq);
 
+   /*
+* Temporary hack to figure out more about the problem, which
+* is causing the ancient network cards to die.
+*/
+   if (desc-handle_irq != handle_edge_irq) {
+   printk(KERN_DEBUG Skip resend for irq %u\n, irq);
+   return;
+   }
+
if ((status  (IRQ_PENDING | IRQ_REPLAY)) == IRQ_PENDING) {
desc-status = (status  ~IRQ_PENDING) | IRQ_REPLAY;
 


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-23 Thread Jarek Poplawski
On Mon, Jul 23, 2007 at 07:44:58AM +0200, Marcin Ślusarz wrote:
 Ok, I've bisected this problem and found that this patch broke my NIC:

Congratulations!

 
 76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
 commit 76d2160147f43f982dfe881404cfde9fd0a9da21
 Author: Ingo Molnar [EMAIL PROTECTED]
 Date:   Fri Feb 16 01:28:24 2007 -0800
 
[PATCH] genirq: do not mask interrupts by default
...
 So I cooked patch like below and everything is working fine (so far)
 
 Fix default_disable interrupt function (broken by [PATCH] genirq: do
 not mask interrupts by default) - revert removal of codepath which was
 invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set
 
 Signed-off-by: Marcin Slusarz [EMAIL PROTECTED]
 ---
 diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
 index 76a9106..0bb23cd 100644
 --- a/kernel/irq/chip.c
 +++ b/kernel/irq/chip.c
 @@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
  */
 static void default_disable(unsigned int irq)
 {
 + struct irq_desc *desc = irq_desc + irq;
 + desc-chip-mask(irq);
 }
 
 /*

I think your patch should very good point the source of the problem
and would help to many people, but it looks like too arbitrary for
those who didn't have such problems. It seems it was mainly with
x86_64, so maybe something like this below would be enough?

Cheers,
Jarek P.

PS: not tested!

---

diff -Nurp 2.6.22-/arch/x86_64/kernel/io_apic.c 
2.6.22/arch/x86_64/kernel/io_apic.c
--- 2.6.22-/arch/x86_64/kernel/io_apic.c2007-07-09 01:32:17.0 
+0200
+++ 2.6.22/arch/x86_64/kernel/io_apic.c 2007-07-23 10:33:05.0 +0200
@@ -1427,6 +1427,7 @@ static struct irq_chip ioapic_chip __rea
.name   = IO-APIC,
.startup= startup_ioapic_irq,
.mask   = mask_IO_APIC_irq,
+   .disable= mask_IO_APIC_irq,
.unmask = unmask_IO_APIC_irq,
.ack= ack_apic_edge,
.eoi= ack_apic_level,
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-07-22 Thread Marcin Ślusarz

Ok, I've bisected this problem and found that this patch broke my NIC:

76d2160147f43f982dfe881404cfde9fd0a9da21 is first bad commit
commit 76d2160147f43f982dfe881404cfde9fd0a9da21
Author: Ingo Molnar [EMAIL PROTECTED]
Date:   Fri Feb 16 01:28:24 2007 -0800

   [PATCH] genirq: do not mask interrupts by default

   Never mask interrupts immediately upon request.  Disabling interrupts in
   high-performance codepaths is rare, and on the other hand this change could
   recover lost edges (or even other types of lost interrupts) by
conservatively
   only masking interrupts after they happen.  (NOTE: with this change the
   highlevel irq-disable code still soft-disables this IRQ line - and
if such an
   interrupt happens then the IRQ flow handler keeps the IRQ masked.)

   Mark i8529A controllers as 'never loses an edge'.

   Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
   Cc: Thomas Gleixner [EMAIL PROTECTED]
   Signed-off-by: Andrew Morton [EMAIL PROTECTED]
   Signed-off-by: Linus Torvalds [EMAIL PROTECTED]

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=76d2160147f43f982dfe881404cfde9fd0a9da21

After reverting it on top of 2.6.21.3 (with
d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2 - [PATCH] genirq: remove
IRQ_DISABLED (which ment remove IRQ_DELAYED_DISABLE)), the problem
didn't show up :)
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7e25f3394ba05a6d64cb2be42c2765fe72ea6b2)

So I cooked patch like below and everything is working fine (so far)

Fix default_disable interrupt function (broken by [PATCH] genirq: do
not mask interrupts by default) - revert removal of codepath which was
invoked when removed flag (IRQ_DELAYED_DISABLE) wag NOT set

Signed-off-by: Marcin Slusarz [EMAIL PROTECTED]
---
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 76a9106..0bb23cd 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -230,6 +230,8 @@ static void default_enable(unsigned int irq)
 */
static void default_disable(unsigned int irq)
{
+   struct irq_desc *desc = irq_desc + irq;
+   desc-chip-mask(irq);
}

/*

(Sorry for whitespace damage, but I have to send it from webmail :|)
(I'm a kernel noob, so don't kill me if my patch is wrong ;)
ps: Here is the beginning of this thread: http://lkml.org/lkml/2007/6/16/182


Regards,
Marcin Slusarz
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-29 Thread Jean-Baptiste Vignaud
Update...
I did 2 tests :

1)  booted with option acpi=off
It booted correctly, i managed to get some load on one of the card and after a 
while (10 minutes i guess) the Timeout occurs. Side effect, at the same moment 
the sata contolers lost control of the disks somehow and the raid 5 array on 
the system crashed hard. I have no traces as i was unable to rebuild it (and i 
tried a lot of extreme  voodoo methods).

2) changed the 3com cards
i replaced by two cards,
01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)

reinstalled and stressed the network (small download from a laptop) and :

Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 29 09:34:51 loki last message repeated 14 times
Jun 29 09:35:18 loki last message repeated 8 times

so it seems to be a more generic problem.

(i'v updated the fedora bugzilla aswell)

did not test the  [PATCH] 8139cp dev-tx_timeout yet.

JB


 On Tue, Jun 26, 2007 at 04:24:07PM +0200, Jean-Baptiste Vignaud wrote:
  Hello, i have a very similar problem with 2.6.21 also;
  
  2 3com NICs and they are failling randomly.
  
  The kernel is a basic fedora 7 kernel (2.6.21-1.3228.fc7)
  I found a bug report and added details here : 
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=243960
  
  I'm not subcribed on this list, so please cc me if there is any questions.
  
  JB
  
   On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
   ...
I reproduced it on minimal config:
 ...
   We know your hardware should be OK - since it was fine with 2.6.20.
 ...
 
 It looks like there is something common in the air...
 
 Marcin: ne2k_pci with 8390, Jean: 3com, and now I see
 similar problem with 8139cp too (plus some ideas):
 
 http://marc.info/?l=linux-netdevm=118293314109648w=2
 
 So, you probably should wait a little  look for new patches here.
 
 Cheers,
 Jarek P.
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-29 Thread Jarek Poplawski
On Fri, Jun 29, 2007 at 10:50:20AM +0200, Jean-Baptiste Vignaud wrote:
 Update...
 I did 2 tests :
 
 1)  booted with option acpi=off
 It booted correctly, i managed to get some load on one of the card
 and after a while (10 minutes i guess) the Timeout occurs. Side effect,
 at the same moment the sata contolers lost control of the disks somehow
 and the raid 5 array on the system crashed hard. I have no traces as i
 was unable to rebuild it (and i tried a lot of extreme  voodoo methods).

I think the main option: acpi=on is usually needed.

If you, guys, are not exhausted yet, I think you could try to
turn off (or change for somethig else) most of the options from
Processors type and features, and maybe something below PCI
support. But there are many new options which couldn't be turned
off so easy, so there is no much hope...

 
 2) changed the 3com cards
 i replaced by two cards,
 01:06.0 Ethernet controller: VIA Technologies, Inc. VT6102 [Rhine-II] (rev 42)
 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8029(AS)
 
 reinstalled and stressed the network (small download from a laptop) and :
 
 Jun 29 09:34:10 loki kernel: NETDEV WATCHDOG: eth0: transmit timed out
 Jun 29 09:34:51 loki last message repeated 14 times
 Jun 29 09:35:18 loki last message repeated 8 times
 
 so it seems to be a more generic problem.

I wonder if you tried to change the place - I've read this
advice many times. And maybe it would be better to try with
one card at first?

It seems there are some patches with dev-tx_timeout but it
looks like fixing results only. Let's wait...

Cheers,
Jarek P.

PS: Marcin - your last message wasn't plain text - so probably
dumped by kernel lists. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-26 Thread Jarek Poplawski
On Tue, Jun 26, 2007 at 08:10:17AM +0200, Marcin Ślusarz wrote:
... 
 I reproduced it on minimal config:
...

Hm... This method is usable if you can find such minimal config
with which the bug cannot be reproduced. Then you can add more
until the bug is back. Of course, this takes time...

We know your hardware should be OK - since it was fine with 2.6.20.
We don't know how much your configs (kernel  apps) have changed.
Sometimes the change of kernel needs some apps to be recompiled too.
That's why it could be usable to try 2.6.21 from a live distro to
find if it's really kernel's fault.

And, alas, this log doesn't seem to tell nothing new...

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-22 Thread Marcin Ślusarz

2007/6/19, Jarek Poplawski [EMAIL PROTECTED]:

On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
 On Mon, 18 Jun 2007 13:08:49 +0200
 Jarek Poplawski [EMAIL PROTECTED] wrote:

  On 16-06-2007 23:35, Marcin .lusarz wrote:
   hi
   after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
   strange problem - my _both_ network cards dies after random uptime -
   sometimes it's a few minutes, sometimes hours, sometimes it does not
   happen for a couple of days...
   today it happened for the first time without nvidia module and almost
   immediately after system start
  
   here is the output of some commands which might help debug this:
...
  It looks like skge driver enables different device than probbed.
  Maybe you've something old/wrong about eth0/eth1 in /etc configs?

 More likely it is just user level device renaming. Most distro's
 rename devices (if needed) using udev.

On the other hand it's interesting, why it's not always, and why
sometimes it took so long?


I'm sorry for delay, but i was offline for the last week and probably
will for some time :|

When I disable on-board network card in BIOS (controlled by skge)
ne2k-pci card is still locking up. So I think it's strictly ne2k-pci
card bug. I made some tests and I know how to reproduce it fast (on my
machine) - just make some heavy network traffic...

As I'm offline right now I can't bisect it, but i turned on more
debugging, maybe you can deduce something...

[0.00] Linux version 2.6.21.3 ([EMAIL PROTECTED]) (gcc version 4.1.2
(Gentoo 4.1.2)) #4 PREEMPT Wed Jun 20 22:37:05 CEST 2007
[0.00] Command line: root=/dev/sda5 video=vesafb vga=794
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009fc00 (usable)
[0.00]  BIOS-e820: 0009fc00 - 000a (reserved)
[0.00]  BIOS-e820: 000e4000 - 0010 (reserved)
[0.00]  BIOS-e820: 0010 - 3ffb (usable)
[0.00]  BIOS-e820: 3ffb - 3ffc (ACPI data)
[0.00]  BIOS-e820: 3ffc - 3fff (ACPI NVS)
[0.00]  BIOS-e820: 3fff - 4000 (reserved)
[0.00]  BIOS-e820: ff78 - 0001 (reserved)
[0.00] Entering add_active_range(0, 0, 159) 0 entries of 256 used
[0.00] Entering add_active_range(0, 256, 262064) 1 entries of 256 used
[0.00] end_pfn_map = 1048576
[0.00] DMI 2.3 present.
[0.00] ACPI: RSDP 000FA810, 0021 (r2 ACPIAM)
[0.00] ACPI: XSDT 3FFB0100, 003C (r1 A M I  OEMXSDT  1427
MSFT   97)
[0.00] ACPI: FACP 3FFB0290, 00F4 (r3 A M I  OEMFACP  1427
MSFT   97)
[0.00] ACPI: DSDT 3FFB03E0, 38A1 (r1  A0036 A00360011
MSFT  10D)
[0.00] ACPI: FACS 3FFC, 0040
[0.00] ACPI: APIC 3FFB0390, 004A (r1 A M I  OEMAPIC  1427
MSFT   97)
[0.00] ACPI: OEMB 3FFC0040, 003F (r1 A M I  OEMBIOS  1427
MSFT   97)
[0.00] Entering add_active_range(0, 0, 159) 0 entries of 256 used
[0.00] Entering add_active_range(0, 256, 262064) 1 entries of 256 used
[0.00] Zone PFN ranges:
[0.00]   DMA 0 - 4096
[0.00]   DMA324096 -  1048576
[0.00]   Normal1048576 -  1048576
[0.00] early_node_map[2] active PFN ranges
[0.00] 0:0 -  159
[0.00] 0:  256 -   262064
[0.00] On node 0 totalpages: 261967
[0.00]   DMA zone: 56 pages used for memmap
[0.00]   DMA zone: 2549 pages reserved
[0.00]   DMA zone: 1394 pages, LIFO batch:0
[0.00]   DMA32 zone: 3526 pages used for memmap
[0.00]   DMA32 zone: 254442 pages, LIFO batch:31
[0.00]   Normal zone: 0 pages used for memmap
[0.00] Looks like a VIA chipset. Disabling IOMMU. Override
with iommu=allowed
[0.00] ACPI: PM-Timer IO Port: 0x808
[0.00] ACPI: Local APIC address 0xfee0
[0.00] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[0.00] Processor #0 (Bootup-CPU)
[0.00] ACPI: IOAPIC (id[0x01] address[0xfec0] gsi_base[0])
[0.00] IOAPIC[0]: apic_id 1, address 0xfec0, GSI 0-23
[0.00] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[0.00] ACPI: IRQ0 used by override.
[0.00] ACPI: IRQ2 used by override.
[0.00] ACPI: IRQ9 used by override.
[0.00] Setting APIC routing to flat
[0.00] Using ACPI (MADT) for SMP configuration information
[0.00] Nosave address range: 0009f000 - 000a
[0.00] Nosave address range: 000a - 000e4000
[0.00] Nosave address range: 000e4000 - 0010
[0.00] Allocating PCI resources starting at 5000 (gap:
4000:bf78)
[0.00] Built 1 zonelists.  Total pages: 255836
[0.00] Kernel 

Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-22 Thread Jarek Poplawski
On Fri, Jun 22, 2007 at 10:56:44AM +0200, Marcin Ślusarz wrote:
...
 When I disable on-board network card in BIOS (controlled by skge)
 ne2k-pci card is still locking up. So I think it's strictly ne2k-pci
 card bug. I made some tests and I know how to reproduce it fast (on my
 machine) - just make some heavy network traffic...
...

I'm no good at hardware, but I guess this log could be not enough.
So, if nobody will find something more sensible, maybe you can try
some of these suggestions:

- you've written it was OK with 2.6.20; it would be interesting
to check if there were any changes in config (beside new options)
or even retry 2.6.20 with current config after make oldconfig;
- during such problems it's better to try to turn off as much
unnecessary options/drivers as possible to find if it's really
about network driver; e.g.: no SMP, tv cards, acpi - only
basic, without options etc.;
- if possible try it with newer kernel e.g. 2.6.22-rc5;
- if possible try it with another, fresh distro (e.g. some live
CD/DVD/USB bootable);
- there was a lockdep warning from tvtime/bttv;
- try to get some more debugging (help: modinfo ne2k-pci).

Regards,
Jarek P.

PS: for anybody interested - here is the beginning of this story:
http://marc.info/?l=linux-kernelm=118202978609968w=2
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-18 Thread Jarek Poplawski
On 16-06-2007 23:35, Marcin .lusarz wrote:
 hi
 after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
 strange problem - my _both_ network cards dies after random uptime -
 sometimes it's a few minutes, sometimes hours, sometimes it does not
 happen for a couple of days...
 today it happened for the first time without nvidia module and almost
 immediately after system start
 
 here is the output of some commands which might help debug this:
...
 [   21.726533] Write protecting the kernel read-only data: 1457k
 [   25.734316] ACPI: PCI Interrupt :00:0a.0[A] - GSI 17 (level,
 low) - IRQ 17
 [   25.734367] skge 1.10 addr 0xfab0 irq 17 chip Yukon-Lite rev 9
 [   25.734763] skge eth0: addr 00:11:d8:60:74:55
 [   25.971279] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
 [   25.971282]   http://www.scyld.com/network/ne2k-pci.html
 [   25.971364] ACPI: PCI Interrupt :00:0c.0[A] - GSI 17 (level,
 low) - IRQ 17
 [   25.971691] eth1: Compex RL2000 found at 0xb000, IRQ 17, 
 00:80:48:DE:5E:89.
 [   26.888372] Linux video capture interface: v2.00
 [   26.906732] bttv: driver version 0.9.17 loaded
...
 [   31.659572] Adding 1020112k swap on /dev/sda2.  Priority:-1
 extents:1 across:1020112k
 [   42.681974] skge eth1: enabling interface
 [   43.228729] NET: Registered protocol family 17
 [   46.429756] Time: acpi_pm clocksource has been installed.
 [   50.743512] NETDEV WATCHDOG: eth0: transmit timed out
 [   50.743521] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=574.
...

It looks like skge driver enables different device than probbed.
Maybe you've something old/wrong about eth0/eth1 in /etc configs?
You can also try with netdev= or pci= kernel parameters.
If no result - resend it, please - maybe with some debugging on
(modinfo skge). BTW - netdev seems to be preferred for this.

Regards,
Jarek P.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-18 Thread Stephen Hemminger
On Mon, 18 Jun 2007 13:08:49 +0200
Jarek Poplawski [EMAIL PROTECTED] wrote:

 On 16-06-2007 23:35, Marcin .lusarz wrote:
  hi
  after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
  strange problem - my _both_ network cards dies after random uptime -
  sometimes it's a few minutes, sometimes hours, sometimes it does not
  happen for a couple of days...
  today it happened for the first time without nvidia module and almost
  immediately after system start
  
  here is the output of some commands which might help debug this:
 ...
  [   21.726533] Write protecting the kernel read-only data: 1457k
  [   25.734316] ACPI: PCI Interrupt :00:0a.0[A] - GSI 17 (level,
  low) - IRQ 17
  [   25.734367] skge 1.10 addr 0xfab0 irq 17 chip Yukon-Lite rev 9
  [   25.734763] skge eth0: addr 00:11:d8:60:74:55
  [   25.971279] ne2k-pci.c:v1.03 9/22/2003 D. Becker/P. Gortmaker
  [   25.971282]   http://www.scyld.com/network/ne2k-pci.html
  [   25.971364] ACPI: PCI Interrupt :00:0c.0[A] - GSI 17 (level,
  low) - IRQ 17
  [   25.971691] eth1: Compex RL2000 found at 0xb000, IRQ 17, 
  00:80:48:DE:5E:89.
  [   26.888372] Linux video capture interface: v2.00
  [   26.906732] bttv: driver version 0.9.17 loaded
 ...
  [   31.659572] Adding 1020112k swap on /dev/sda2.  Priority:-1
  extents:1 across:1020112k
  [   42.681974] skge eth1: enabling interface
  [   43.228729] NET: Registered protocol family 17
  [   46.429756] Time: acpi_pm clocksource has been installed.
  [   50.743512] NETDEV WATCHDOG: eth0: transmit timed out
  [   50.743521] eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=574.
 ...
 
 It looks like skge driver enables different device than probbed.
 Maybe you've something old/wrong about eth0/eth1 in /etc configs?

More likely it is just user level device renaming. Most distro's
rename devices (if needed) using udev.

 You can also try with netdev= or pci= kernel parameters.

Bad idea. 

 If no result - resend it, please - maybe with some debugging on
 (modinfo skge). BTW - netdev seems to be preferred for this.

What is the contents of /proc/interrupts

-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-18 Thread Jarek Poplawski
On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
 On Mon, 18 Jun 2007 13:08:49 +0200
 Jarek Poplawski [EMAIL PROTECTED] wrote:
...
  It looks like skge driver enables different device than probbed.
  Maybe you've something old/wrong about eth0/eth1 in /etc configs?
 
 More likely it is just user level device renaming. Most distro's
 rename devices (if needed) using udev.

I hope you're right, and the problem is resolved already, but for
historical reasons I'd notice the original message with quite a lot
of configs is available on linux-kernel.

Regards,
Jarek P.

 
  You can also try with netdev= or pci= kernel parameters.
 
 Bad idea. 
 
  If no result - resend it, please - maybe with some debugging on
  (modinfo skge). BTW - netdev seems to be preferred for this.
 
 What is the contents of /proc/interrupts

---

 On 16-06-2007 23:35, Marcin .lusarz wrote:
...
 joi ~ # cat /proc/interrupts ; sleep 5; cat /proc/interrupts
   CPU0
  0: 891160   IO-APIC-edge  timer
  1:   2218   IO-APIC-edge  i8042
  8:  2   IO-APIC-edge  rtc
  9:  1   IO-APIC-fasteoi   acpi
 12:   9110   IO-APIC-edge  i8042
 14:  0   IO-APIC-edge  libata
 15:122   IO-APIC-edge  libata
 17: 12   IO-APIC-fasteoi   eth1, eth0
 18:  57275   IO-APIC-fasteoi   bttv0
 20:  18810   IO-APIC-fasteoi   libata
 21:  0   IO-APIC-fasteoi   ehci_hcd:usb1
 22:  77945   IO-APIC-fasteoi   VIA8237
 NMI:  0
 LOC: 890924
 ERR:  0
   CPU0
  0: 896221   IO-APIC-edge  timer
  1:   2219   IO-APIC-edge  i8042
  8:  2   IO-APIC-edge  rtc
  9:  1   IO-APIC-fasteoi   acpi
 12:   9110   IO-APIC-edge  i8042
 14:  0   IO-APIC-edge  libata
 15:122   IO-APIC-edge  libata
 17: 12   IO-APIC-fasteoi   eth1, eth0
 18:  57654   IO-APIC-fasteoi   bttv0
 20:  18813   IO-APIC-fasteoi   libata
 21:  0   IO-APIC-fasteoi   ehci_hcd:usb1
 22:  78421   IO-APIC-fasteoi   VIA8237
 NMI:  0
 LOC: 895984
 ERR:  0
 
 joi ~ # cat /proc/ioports
 -001f : dma1
 0020-0021 : pic1
 0040-0043 : timer0
 0050-0053 : timer1
 0060-006f : keyboard
 0070-0077 : rtc
 0080-008f : dma page reg
 00a0-00a1 : pic2
 00c0-00df : dma2
 00f0-00ff : fpu
 0170-0177 : :00:0f.1
  0170-0177 : libata
 01f0-01f7 : :00:0f.1
  01f0-01f7 : libata
 0290-0297 : pnp 00:09
 02f8-02ff : serial
 0376-0376 : :00:0f.1
  0376-0376 : libata
 03c0-03df : vesafb
 03f6-03f6 : :00:0f.1
  03f6-03f6 : libata
 03f8-03ff : serial
 0400-0407 : vt596_smbus
 0680-06ff : pnp 00:09
 0800-0803 : ACPI PM1a_EVT_BLK
 0804-0805 : ACPI PM1a_CNT_BLK
 0808-080b : ACPI PM_TMR
 0810-0815 : ACPI CPU throttle
 0820-0823 : ACPI GPE0_BLK
 0cf8-0cff : PCI conf1
 1000-10ff : :00:11.6
 a800-a8ff : :00:0a.0
  a800-a8ff : skge
 b000-b01f : :00:0c.0
  b000-b01f : ne2k-pci
 b400-b4ff : :00:0f.0
  b400-b4ff : sata_via
 b800-b80f : :00:0f.0
  b800-b80f : sata_via
 c000-c003 : :00:0f.0
  c000-c003 : sata_via
 c400-c407 : :00:0f.0
  c400-c407 : sata_via
 c800-c803 : :00:0f.0
  c800-c803 : sata_via
 d000-d007 : :00:0f.0
  d000-d007 : sata_via
 d400-d41f : :00:10.0
 d800-d81f : :00:10.1
 e000-e01f : :00:10.2
 e400-e41f : :00:10.3
 e800-e8ff : :00:11.5
  e800-e8ff : VIA8237
 fc00-fc0f : :00:0f.1
  fc00-fc0f : libata
 
 joi ~ # cat /proc/iomem
 -0009fbff : System RAM
 0009fc00-0009 : reserved
 000c-000d : pnp 00:0e
 000e4000-000f : reserved
 0010-3ffa : System RAM
  0020-0059ebc7 : Kernel code
  0059ebc8-0077248f : Kernel data
 3ffb-3ffb : ACPI Tables
 3ffc-3ffe : ACPI Non-volatile Storage
 3fff-3fff : reserved
 e800-ebff : :00:00.0
  e800-ebff : aperture
 efe0-efe00fff : :00:0d.0
  efe0-efe00fff : bttv0
 eff0-eff00fff : :00:0d.1
 f000-f9ff : PCI Bus #01
  f000-f7ff : :01:00.0
f000-f7ff : vesafb
 faa0-faa1 : :00:0a.0
 fab0-fab03fff : :00:0a.0
  fab0-fab03fff : skge
 fac0-fac07fff : :00:0c.0
 fae0-fae000ff : :00:10.4
  fae0-fae000ff : ehci_hcd
 faf0-fbff : PCI Bus #01
  faf0-faf1 : :01:00.0
  fb00-fbff : :01:00.0
 fec0-fec00fff : IOAPIC 0
  fec0-fec00fff : pnp 00:0b
 fee0-fee00fff : Local APIC
 ff78- : reserved
 
 joi ~ # cat /proc/sys/kernel/tainted
 0
...
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.20-2.6.21 - networking dies after random time

2007-06-18 Thread Jarek Poplawski
On Mon, Jun 18, 2007 at 08:10:00AM -0700, Stephen Hemminger wrote:
 On Mon, 18 Jun 2007 13:08:49 +0200
 Jarek Poplawski [EMAIL PROTECTED] wrote:
 
  On 16-06-2007 23:35, Marcin .lusarz wrote:
   hi
   after upgrading kernel from 2.6.20 to 2.6.21.3 i'm experiencing really
   strange problem - my _both_ network cards dies after random uptime -
   sometimes it's a few minutes, sometimes hours, sometimes it does not
   happen for a couple of days...
   today it happened for the first time without nvidia module and almost
   immediately after system start
   
   here is the output of some commands which might help debug this:
...
  It looks like skge driver enables different device than probbed.
  Maybe you've something old/wrong about eth0/eth1 in /etc configs?
 
 More likely it is just user level device renaming. Most distro's
 rename devices (if needed) using udev.

On the other hand it's interesting, why it's not always, and why
sometimes it took so long?

Jarek P. 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html