[networking-discuss] e1000 networking dies on snv_111b

taemun Sun, 13 Dec 2009 10:43:46 -0800

I'm quite new to OSol, this is my first build on real hardware. It's a Tyan 
S5502 (Intel 3420 chipset, 1156 socket Xeon support, etc).


It has three on-board e1000 Intel Gigabit adapters.

We've had times when the system will stop responding to packets on the middle 
ethernet adapter (only have that one plugged in at present, it's where the iKVM 
lives). The system hasn't crashed, but the network adapter won't reply to pings 
or ARP requests externally. For all intents and purposes, I pulled the network 
cable, forgiving that the iKVM is still functional.

Each time it has been during a period of large network load (maybe 100MB/s 
incoming to the system, ACK's outgoing).

fmadm faulty has presented this fault twice (although we've had the "crash" 
maybe four times):
-----------------------------------------------------------------------------------------------------------
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Dec 13 19:31:39 61087804-5bff-ca8a-caf2-df023a1fb43b  PCIEX-8000-J5  Major

Fault class : fault.io.pciex.device-interr-corr
Affects     : dev:////p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0
                  faulted but still in service
FRU         : "MB" 
(hc://:product-id=empty:chassis-id=empty:server-id=hostname/motherboard=0)
                  faulty

Description : Too many recovered internal errors have been detected within the
              specified PCIEX device. This may degrade into a non-recoverable
              fault.
              Refer to http://sun.com/msg/PCIEX-8000-J5 for more information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Schedule a repair procedure to replace the affected device. Use
              fmadm faulty to identify the device or contact Sun for support.
-----------------------------------------------------------------------------------------------------------
prtconf -v | grep 3b44
        pci8086,3b44, instance #5
                    value='pciex8086,3b44.5' + 'pciex8086,3b44' + 
'pciexclass,060400' + 'pciexclass,0604' + 'pci8086,3b44.5' + 'pci8086,3b44' + 
'pciclass,060400' + 'pciclass,0604'
                    value=00003b44
                    dev_path=/p...@0,0/pci8086,3...@1c,1:devctl
                    dev_path=/p...@0,0/pci8086,3...@1c,1:pcie0
                            value='/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0'
                        
dev_path=/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0:e1000g1
-----------------------------------------------------------------------------------------------------------

So it looks like the network card is having problems.

I had read about something like this effecting a Realtek network card, and 
their solution was to buy an Intel card. So much for that ;)

An attempt to manually unplumb the network card was met with a pause. No 
response. An attempt to "cfgadm -f -c unconfigure pcie_pci5.pcie0" caused 
kernel panic as below:
-----------------------------------------------------------------------------------------------------------
Dec 14 05:10:26 hostname in.ndpd[584]: [ID 169330 daemon.error] Interface 
e1000g1 has been removed from kernel. in.ndpd will no longer use it
Dec 14 05:10:27 hostname  genunix: [ID 408114 kern.info] 
/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0 (e1000g1) online
Dec 14 05:10:27 hostname  unix: [ID 836849 kern.notice] 
Dec 14 05:10:27 hostname  ^Mpanic[cpu1]/thread=ffffff00104e0c60: 
Dec 14 05:10:27 hostname  genunix: [ID 403854 kern.notice] assertion failed: 
avl_find(&ips->ips_avl_by_name, (void *)name, &where) == NULL, file: 
../../common/inet/ipnet/ipnet.c, line: 1199
Dec 14 05:10:27 hostname  unix: [ID 100000 kern.notice] 
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0ab0 
genunix:assfail+7e ()
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0b10 
ipnet:ipnet_create_if+1d1 ()
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0b80 
ipnet:ipnet_plumb_ev+56 ()
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0bc0 
ipnet:ipnet_nicevent_task+6b ()
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0c40 
genunix:taskq_thread+193 ()
Dec 14 05:10:27 hostname  genunix: [ID 655072 kern.notice] ffffff00104e0c50 
unix:thread_start+8 ()
Dec 14 05:10:27 hostname  unix: [ID 100000 kern.notice] 
Dec 14 05:10:27 hostname  genunix: [ID 672855 kern.notice] syncing file 
systems...
Dec 14 05:10:27 hostname  genunix: [ID 904073 kern.notice]  done
Dec 14 05:10:28 hostname  genunix: [ID 111219 kern.notice] dumping to 
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Dec 14 05:10:28 hostname  ahci: [ID 405573 kern.info] NOTICE: ahci0: 
ahci_tran_reset_dport port 0 reset port
Dec 14 05:11:07 hostname  genunix: [ID 409368 kern.notice] ^M100% done: 460928 
pages dumped, compression ratio 2.15, 
Dec 14 05:11:07 hostname  genunix: [ID 851671 kern.notice] dump succeeded
-----------------------------------------------------------------------------------------------------------

For all I know, maybe that is the normal response to trying to pull an "active" 
network card.

I'm only running this "old" version because every version newer (snv_117-128a) 
have broken Dell PERC5/i drivers/something else that causes huge problems for 
that card.

If anyone can help me with this it'd be great. If the text dumps provided 
aren't what you need, let me know.

Thanks
-- 
This message posted from opensolaris.org
_______________________________________________
networking-discuss mailing list
[email protected]

[networking-discuss] e1000 networking dies on snv_111b

Reply via email to