Re: net_rx_action/NAPI oops [PATCH]

2007-11-30 Thread Robert Olsson

 Hello!

 After further investigations. The bug was just in front of us...

 ifconfig down in combination with the test for || !netif_running() can 
 return a full quota and netif_rx_complete() done which causes the oops 
 in net_rx_action. Of course the load must be high enough to fill the 
 quota.

 I've found a bunch of drivers having this bug.

 Cheers.
--ro

 
Signed-off-by: Robert Olsson [EMAIL PROTECTED]


diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index cf39473..f4137ad 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3947,6 +3947,10 @@ e1000_clean(struct napi_struct *napi, int budget)
 quit_polling:
if (likely(adapter-itr_setting  3))
e1000_set_itr(adapter);
+
+   if(work_done == budget)
+   work_done--;
+   
netif_rx_complete(poll_dev, napi);
e1000_irq_enable(adapter);
}
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4fd2e23..e43b5ca 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1408,6 +1408,9 @@ static int e1000_clean(struct napi_struct *napi, int 
budget)
if ((!tx_cleaned  (work_done  budget)) ||
   !netif_running(poll_dev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
if (adapter-itr_setting  3)
e1000_set_itr(adapter);
netif_rx_complete(poll_dev, napi);
diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
index 3021234..e3064ef 100644
--- a/drivers/net/ixgb/ixgb_main.c
+++ b/drivers/net/ixgb/ixgb_main.c
@@ -1783,6 +1783,10 @@ ixgb_clean(struct napi_struct *napi, int budget)
 
/* if no Tx and not enough Rx work done, exit the polling mode */
if((!tx_cleaned  (work_done == 0)) || !netif_running(netdev)) {
+
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
ixgb_irq_enable(adapter);
}
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 00bc525..204f5fa 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -579,6 +579,9 @@ static int ixgbe_clean_rxonly(struct napi_struct *napi, int 
budget)
/* If no Tx and not enough Rx work done, exit the polling mode */
if ((work_done  budget) || !netif_running(netdev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
if (!test_bit(__IXGBE_DOWN, adapter-state))
IXGBE_WRITE_REG(adapter-hw, IXGBE_EIMS,
@@ -1483,6 +1486,9 @@ static int ixgbe_clean(struct napi_struct *napi, int 
budget)
if ((!tx_cleaned  (work_done  budget)) ||
!netif_running(adapter-netdev)) {
 quit_polling:
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
ixgbe_irq_enable(adapter);
}
diff --git a/drivers/net/e100.c b/drivers/net/e100.c
index 3dbaec6..c566491 100644
--- a/drivers/net/e100.c
+++ b/drivers/net/e100.c
@@ -1998,6 +1998,10 @@ static int e100_poll(struct napi_struct *napi, int 
budget)
 
/* If no Rx and Tx cleanup work was done, exit polling mode. */
if((!tx_cleaned  (work_done == 0)) || !netif_running(netdev)) {
+
+   if(work_done == budget)
+   work_done--;
+
netif_rx_complete(netdev, napi);
e100_enable_irq(nic);
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-30 Thread Kok, Auke
Robert Olsson wrote:
  Hello!
 
  After further investigations. The bug was just in front of us...
 
  ifconfig down in combination with the test for || !netif_running() can 
  return a full quota and netif_rx_complete() done which causes the oops 
  in net_rx_action. Of course the load must be high enough to fill the 
  quota.
 
  I've found a bunch of drivers having this bug.

OK, I think however that we want to get rid of the netif_running() tests. I 
think
they're an artifact of an old bug that was fixed a long time ago and we no 
longer
need to check for that at all. Your patch (minus the formatting mistakes) looks
otherwise OK and I'll have someone do some quick testing on it.

Thanks,

Auke
 
  Cheers.
   --ro
 
  
 Signed-off-by: Robert Olsson [EMAIL PROTECTED]
 
 
 diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
 index cf39473..f4137ad 100644
 --- a/drivers/net/e1000/e1000_main.c
 +++ b/drivers/net/e1000/e1000_main.c
 @@ -3947,6 +3947,10 @@ e1000_clean(struct napi_struct *napi, int budget)
  quit_polling:
   if (likely(adapter-itr_setting  3))
   e1000_set_itr(adapter);
 +
 + if(work_done == budget)
 + work_done--;
 + 
   netif_rx_complete(poll_dev, napi);
   e1000_irq_enable(adapter);
   }
 diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
 index 4fd2e23..e43b5ca 100644
 --- a/drivers/net/e1000e/netdev.c
 +++ b/drivers/net/e1000e/netdev.c
 @@ -1408,6 +1408,9 @@ static int e1000_clean(struct napi_struct *napi, int 
 budget)
   if ((!tx_cleaned  (work_done  budget)) ||
  !netif_running(poll_dev)) {
  quit_polling:
 + if(work_done == budget)
 + work_done--;
 +
   if (adapter-itr_setting  3)
   e1000_set_itr(adapter);
   netif_rx_complete(poll_dev, napi);
 diff --git a/drivers/net/ixgb/ixgb_main.c b/drivers/net/ixgb/ixgb_main.c
 index 3021234..e3064ef 100644
 --- a/drivers/net/ixgb/ixgb_main.c
 +++ b/drivers/net/ixgb/ixgb_main.c
 @@ -1783,6 +1783,10 @@ ixgb_clean(struct napi_struct *napi, int budget)
  
   /* if no Tx and not enough Rx work done, exit the polling mode */
   if((!tx_cleaned  (work_done == 0)) || !netif_running(netdev)) {
 +
 + if(work_done == budget)
 + work_done--;
 +
   netif_rx_complete(netdev, napi);
   ixgb_irq_enable(adapter);
   }
 diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
 index 00bc525..204f5fa 100644
 --- a/drivers/net/ixgbe/ixgbe_main.c
 +++ b/drivers/net/ixgbe/ixgbe_main.c
 @@ -579,6 +579,9 @@ static int ixgbe_clean_rxonly(struct napi_struct *napi, 
 int budget)
   /* If no Tx and not enough Rx work done, exit the polling mode */
   if ((work_done  budget) || !netif_running(netdev)) {
  quit_polling:
 + if(work_done == budget)
 + work_done--;
 +
   netif_rx_complete(netdev, napi);
   if (!test_bit(__IXGBE_DOWN, adapter-state))
   IXGBE_WRITE_REG(adapter-hw, IXGBE_EIMS,
 @@ -1483,6 +1486,9 @@ static int ixgbe_clean(struct napi_struct *napi, int 
 budget)
   if ((!tx_cleaned  (work_done  budget)) ||
   !netif_running(adapter-netdev)) {
  quit_polling:
 + if(work_done == budget)
 + work_done--;
 +
   netif_rx_complete(netdev, napi);
   ixgbe_irq_enable(adapter);
   }
 diff --git a/drivers/net/e100.c b/drivers/net/e100.c
 index 3dbaec6..c566491 100644
 --- a/drivers/net/e100.c
 +++ b/drivers/net/e100.c
 @@ -1998,6 +1998,10 @@ static int e100_poll(struct napi_struct *napi, int 
 budget)
  
   /* If no Rx and Tx cleanup work was done, exit polling mode. */
   if((!tx_cleaned  (work_done == 0)) || !netif_running(netdev)) {
 +
 + if(work_done == budget)
 + work_done--;
 +
   netif_rx_complete(netdev, napi);
   e100_enable_irq(nic);
   }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson

Kok, Auke writes:
  
  Robert, please give that patch a try (it fixes a crash that I had here as 
  well)
  and let us know if it works for you.

 No it doesn't cure the problem I've reported

 Cheers.
--ro


 BTW. You can try to verify the problem... Let pktgen feed a couple of streams
 into some interfaces. (I use two) ifconfig down sone of the incoming 
interfaces
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson

Stephen Hemminger writes:

  It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear 
  NAPI_STATE_SCHED)
  and do a full quota. That bug already had to be fixed in other drivers,
  look like e1000 has same problem.

 From what I see the problem is not related to -poll. But it's a variant of 
the same theme
 here it's napi_disable in  _down via ifconfig down that clears 
NAPI_STATE_SCHED while
 driver delivers a full quota.
 
 Cheers
--ro
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Stephen Hemminger
Would this fix it?

--- a/drivers/net/e1000/e1000_main.c2007-11-15 21:13:12.0 -0800
+++ b/drivers/net/e1000/e1000_main.c2007-11-28 08:37:03.0 -0800
@@ -630,10 +630,10 @@ e1000_down(struct e1000_adapter *adapter
 * reschedule our watchdog timer */
set_bit(__E1000_DOWN, adapter-flags);
 
+   e1000_irq_disable(adapter);
 #ifdef CONFIG_E1000_NAPI
-   napi_disable(adapter-napi);
+   napi_synchronize(adapter-napi);
 #endif
-   e1000_irq_disable(adapter);
 
del_timer_sync(adapter-tx_fifo_stall_timer);
del_timer_sync(adapter-watchdog_timer);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-28 Thread Robert Olsson


 No it doesn't. besides napi_disable and napi_synchronize are identical. 
 I was trying to disarm interrupts this way too. 

 The patch I did send yesterday is the only cure so-far but I don't if 
 it's 100% bullet proof either.

 I was stress-testing it patch but ran into new problems...(scheduling)
 
 Cheers.
--ro


Stephen Hemminger writes:
  Would this fix it?
  
  --- a/drivers/net/e1000/e1000_main.c 2007-11-15 21:13:12.0 -0800
  +++ b/drivers/net/e1000/e1000_main.c 2007-11-28 08:37:03.0 -0800
  @@ -630,10 +630,10 @@ e1000_down(struct e1000_adapter *adapter
* reschedule our watchdog timer */
   set_bit(__E1000_DOWN, adapter-flags);
   
  +e1000_irq_disable(adapter);
   #ifdef CONFIG_E1000_NAPI
  -napi_disable(adapter-napi);
  +napi_synchronize(adapter-napi);
   #endif
  -e1000_irq_disable(adapter);
   
   del_timer_sync(adapter-tx_fifo_stall_timer);
   del_timer_sync(adapter-watchdog_timer);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-27 Thread Stephen Hemminger
On Tue, 27 Nov 2007 19:52:24 +0100
Robert Olsson [EMAIL PROTECTED] wrote:

 
 Hello!
 
 I've discovered a bug while testing the new multiQ NAPI code. In hi-load 
 situations when we take down an interface we get a kernel panic. The
 oops is below.
 
 From what I see this happens when driver does napi_disable() and clears
 NAPI_STATE_SCHED. In net_rx_action there is a check for work == weight 
 a sort indirect test but that's now not enough to cover the load situation. 
 where we have NAPI_STATE_SCHED cleared by e1000_down in my case and still 
 full quota. Latest git but I'll guess the is the same in all later kernels.
 There might be different solutions... one variant is below:

It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear 
NAPI_STATE_SCHED)
and do a full quota. That bug already had to be fixed in other drivers,
look like e1000 has same problem.


-- 
Stephen Hemminger [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-27 Thread Kok, Auke
Stephen Hemminger wrote:
 On Tue, 27 Nov 2007 19:52:24 +0100
 Robert Olsson [EMAIL PROTECTED] wrote:
 
 Hello!

 I've discovered a bug while testing the new multiQ NAPI code. In hi-load 
 situations when we take down an interface we get a kernel panic. The
 oops is below.

 From what I see this happens when driver does napi_disable() and clears
 NAPI_STATE_SCHED. In net_rx_action there is a check for work == weight 
 a sort indirect test but that's now not enough to cover the load situation. 
 where we have NAPI_STATE_SCHED cleared by e1000_down in my case and still 
 full quota. Latest git but I'll guess the is the same in all later kernels.
 There might be different solutions... one variant is below:
 
 It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear 
 NAPI_STATE_SCHED)
 and do a full quota. That bug already had to be fixed in other drivers,
 look like e1000 has same problem.

Stephen,

please enlighten me, can you e.g. show me a commit of other drivers where you
fixed this up?

Thanks,

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net_rx_action/NAPI oops [PATCH]

2007-11-27 Thread Kok, Auke
Stephen Hemminger wrote:
 On Tue, 27 Nov 2007 14:34:44 -0800
 Kok, Auke [EMAIL PROTECTED] wrote:
 
 Stephen Hemminger wrote:
 On Tue, 27 Nov 2007 19:52:24 +0100
 Robert Olsson [EMAIL PROTECTED] wrote:

 Hello!

 I've discovered a bug while testing the new multiQ NAPI code. In hi-load 
 situations when we take down an interface we get a kernel panic. The
 oops is below.

 From what I see this happens when driver does napi_disable() and clears
 NAPI_STATE_SCHED. In net_rx_action there is a check for work == weight 
 a sort indirect test but that's now not enough to cover the load 
 situation. 
 where we have NAPI_STATE_SCHED cleared by e1000_down in my case and still 
 full quota. Latest git but I'll guess the is the same in all later kernels.
 There might be different solutions... one variant is below:
 It is considered a driver bug in 2.6.24 to call netif_rx_complete (clear 
 NAPI_STATE_SCHED)
 and do a full quota. That bug already had to be fixed in other drivers,
 look like e1000 has same problem.
 Stephen,

 please enlighten me, can you e.g. show me a commit of other drivers where you
 fixed this up?

 Thanks,

 Auke
 
 Author: David S. Miller [EMAIL PROTECTED]  2007-10-11 18:08:29
 Committer: David S. Miller [EMAIL PROTECTED]  2007-10-11 18:08:29
 Parent: b9f2c0440d806e01968c3ed4def930a43be248ad ([netdrvr] Stop using legacy 
 hooks -self_test_count, -get_stats_count)
 Child:  266918303226cceac7eca38ced30f15f277bd89c ([SKY2]: status polling loop 
 (post merge))
 Branches: master, origin
 Follows: v2.6.23
 Precedes: v2.6.24-rc1
 
 [NET]: Fix NAPI completion handling in some drivers.
 
 In order for the list handling in net_rx_action() to be
 correct, drivers must follow certain rules as stated by
 this comment in net_rx_action():
 
   /* Drivers must not modify the NAPI state if they
* consume the entire weight.  In such cases this code
* still owns the NAPI instance and therefore can
* move the instance around on the list at-will.
*/
 
 A few drivers do not do this because they mix the budget checks
 with reading hardware state, resulting in crashes like the one
 reported by [EMAIL PROTECTED]
 
 BNX2 and TG3 are taken care of here, SKY2 fix is from Stephen
 Hemminger.
 
 Signed-off-by: David S. Miller [EMAIL PROTECTED]

OK, I'm not sure what went wrong there with e1000, but I'll send a patch in a 
second.

Robert, please give that patch a try (it fixes a crash that I had here as well)
and let us know if it works for you.

Auke
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html