[E1000-devel] e1000_close() and concurrent reset

Tim Pepper Wed, 15 Feb 2012 13:31:40 -0800

I've got some systems whose nic's periodically go up and down, plus I
believe there are occasionally some "interesting" external network issues
triggering nic resets.  These machines are running the 1.6.3 Intel driver
on a 2.6.32 based kernel.  I have two sets of crashes that go:


Feb  9 22:50:41 kernel: e1000e 0000:0b:00.0: eth1: Reset adapter 
    ...
Feb  9 22:50:41 kernel: WARNING: at 
extra_drivers/open/e1000e_1_6_3/netdev.c:4676 e1000_close+0x162/0x170 
[e1000e_1_6_3]()
    ...
Feb  9 22:50:41 kernel: BUG: unable to handle kernel NULL pointer dereference 
at 00000004
Feb  9 22:50:41 kernel: IP: [<f8747e55>] e1000_put_txbuf+0x15/0x90 
[e1000e_1_6_3]
    ...
Feb  9 22:50:45 kernel: kernel BUG at drivers/pci/msi.c:284!
Feb  9 22:50:45 kernel: invalid opcode: 0000 [#2] SMP


Feb 14 13:50:06 kernel: e1000e 0000:15:00.0: eth0: Reset adapter 
    ...
Feb 14 13:50:06 kernel: WARNING: at 
extra_drivers/open/e1000e_1_6_3/netdev.c:4676 e1000_close+0x162/0x170 
[e1000e_1_6_3]()
    ...
Feb 14 13:50:06 kernel: BUG: unable to handle kernel NULL pointer dereference 
at 00000008
Feb 14 13:50:06 kernel: IP: [<f866f6a8>] e1000_alloc_rx_buffers+0x98/0x270 
[e1000e_1_6_3]
    ...
Feb 14 13:50:07 kernel: kernel BUG at drivers/pci/msi.c:284!
Feb 14 13:50:07 kernel: invalid opcode: 0000 [#2] SMP


A very similar bug report is here: 
http://lists.openwall.net/netdev/2011/11/14/127
and notes two issues:
   1) The napi_enable() and napi_disable() should only be called in the
      e1000_open and e1000_close functions respectively
   2) There no synchronization preventing a call to the driver close while
      executing error processing.

This led to upstream kernel commit
5f4a780ddd453c4918555fed9d9c5f2d455a087d with respect to #1 about a month
after 1.6.3 came out.  I don't see the fix for #1 in driver 1.9.5 though
which came out a few weeks after the upstream commit.  Is this fix going
to be available in an Intel driver update in the future?

We don't explicitly set CONFIG_E1000E_NAPI in our build, but it looks
like src/kcompat.h probably automagically sets it since we haven't
defined E1000E_NO_NAPI.  So we likely hit issue #1.

But what about #2?  It seems like something would still be needed to
address that and given a reading of the code paths involved with the
above kernel warnings/bugs, that concurrency issue seem to be just what
we're hitting.  Does Intel have a fix in the works for that portion?
Any patches we might be able to test?


-- 
Tim Pepper  <[email protected]>
IBM Linux Technology Center


------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

[E1000-devel] e1000_close() and concurrent reset

Reply via email to