I've got some systems whose nic's periodically go up and down, plus I
believe there are occasionally some "interesting" external network issues
triggering nic resets. These machines are running the 1.6.3 Intel driver
on a 2.6.32 based kernel. I have two sets of crashes that go:
Feb 9 22:50:41 kernel: e1000e 0000:0b:00.0: eth1: Reset adapter
...
Feb 9 22:50:41 kernel: WARNING: at
extra_drivers/open/e1000e_1_6_3/netdev.c:4676 e1000_close+0x162/0x170
[e1000e_1_6_3]()
...
Feb 9 22:50:41 kernel: BUG: unable to handle kernel NULL pointer dereference
at 00000004
Feb 9 22:50:41 kernel: IP: [<f8747e55>] e1000_put_txbuf+0x15/0x90
[e1000e_1_6_3]
...
Feb 9 22:50:45 kernel: kernel BUG at drivers/pci/msi.c:284!
Feb 9 22:50:45 kernel: invalid opcode: 0000 [#2] SMP
Feb 14 13:50:06 kernel: e1000e 0000:15:00.0: eth0: Reset adapter
...
Feb 14 13:50:06 kernel: WARNING: at
extra_drivers/open/e1000e_1_6_3/netdev.c:4676 e1000_close+0x162/0x170
[e1000e_1_6_3]()
...
Feb 14 13:50:06 kernel: BUG: unable to handle kernel NULL pointer dereference
at 00000008
Feb 14 13:50:06 kernel: IP: [<f866f6a8>] e1000_alloc_rx_buffers+0x98/0x270
[e1000e_1_6_3]
...
Feb 14 13:50:07 kernel: kernel BUG at drivers/pci/msi.c:284!
Feb 14 13:50:07 kernel: invalid opcode: 0000 [#2] SMP
A very similar bug report is here:
http://lists.openwall.net/netdev/2011/11/14/127
and notes two issues:
1) The napi_enable() and napi_disable() should only be called in the
e1000_open and e1000_close functions respectively
2) There no synchronization preventing a call to the driver close while
executing error processing.
This led to upstream kernel commit
5f4a780ddd453c4918555fed9d9c5f2d455a087d with respect to #1 about a month
after 1.6.3 came out. I don't see the fix for #1 in driver 1.9.5 though
which came out a few weeks after the upstream commit. Is this fix going
to be available in an Intel driver update in the future?
We don't explicitly set CONFIG_E1000E_NAPI in our build, but it looks
like src/kcompat.h probably automagically sets it since we haven't
defined E1000E_NO_NAPI. So we likely hit issue #1.
But what about #2? It seems like something would still be needed to
address that and given a reading of the code paths involved with the
above kernel warnings/bugs, that concurrency issue seem to be just what
we're hitting. Does Intel have a fix in the works for that portion?
Any patches we might be able to test?
--
Tim Pepper <[email protected]>
IBM Linux Technology Center
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit
http://communities.intel.com/community/wired