I'm quite new to OSol, this is my first build on real hardware. It's a Tyan
S5502 (Intel 3420 chipset, 1156 socket Xeon support, etc).
It has three on-board e1000 Intel Gigabit adapters.
We've had times when the system will stop responding to packets on the middle
ethernet adapter (only have that one plugged in at present, it's where the iKVM
lives). The system hasn't crashed, but the network adapter won't reply to pings
or ARP requests externally. For all intents and purposes, I pulled the network
cable, forgiving that the iKVM is still functional.
Each time it has been during a period of large network load (maybe 100MB/s
incoming to the system, ACK's outgoing).
fmadm faulty has presented this fault twice (although we've had the "crash"
maybe four times):
-----------------------------------------------------------------------------------------------------------
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Dec 13 19:31:39 61087804-5bff-ca8a-caf2-df023a1fb43b PCIEX-8000-J5 Major
Fault class : fault.io.pciex.device-interr-corr
Affects : dev:////p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0
faulted but still in service
FRU : "MB"
(hc://:product-id=empty:chassis-id=empty:server-id=hostname/motherboard=0)
faulty
Description : Too many recovered internal errors have been detected within the
specified PCIEX device. This may degrade into a non-recoverable
fault.
Refer to http://sun.com/msg/PCIEX-8000-J5 for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with
this fault
Action : Schedule a repair procedure to replace the affected device. Use
fmadm faulty to identify the device or contact Sun for support.
-----------------------------------------------------------------------------------------------------------
prtconf -v | grep 3b44
pci8086,3b44, instance #5
value='pciex8086,3b44.5' + 'pciex8086,3b44' +
'pciexclass,060400' + 'pciexclass,0604' + 'pci8086,3b44.5' + 'pci8086,3b44' +
'pciclass,060400' + 'pciclass,0604'
value=00003b44
dev_path=/p...@0,0/pci8086,3...@1c,1:devctl
dev_path=/p...@0,0/pci8086,3...@1c,1:pcie0
value='/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0'
dev_path=/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0:e1000g1
-----------------------------------------------------------------------------------------------------------
So it looks like the network card is having problems.
I had read about something like this effecting a Realtek network card, and
their solution was to buy an Intel card. So much for that ;)
An attempt to manually unplumb the network card was met with a pause. No
response. An attempt to "cfgadm -f -c unconfigure pcie_pci5.pcie0" caused
kernel panic as below:
-----------------------------------------------------------------------------------------------------------
Dec 14 05:10:26 hostname in.ndpd[584]: [ID 169330 daemon.error] Interface
e1000g1 has been removed from kernel. in.ndpd will no longer use it
Dec 14 05:10:27 hostname genunix: [ID 408114 kern.info]
/p...@0,0/pci8086,3...@1c,1/pci10f1,5...@0 (e1000g1) online
Dec 14 05:10:27 hostname unix: [ID 836849 kern.notice]
Dec 14 05:10:27 hostname ^Mpanic[cpu1]/thread=ffffff00104e0c60:
Dec 14 05:10:27 hostname genunix: [ID 403854 kern.notice] assertion failed:
avl_find(&ips->ips_avl_by_name, (void *)name, &where) == NULL, file:
../../common/inet/ipnet/ipnet.c, line: 1199
Dec 14 05:10:27 hostname unix: [ID 100000 kern.notice]
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0ab0
genunix:assfail+7e ()
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0b10
ipnet:ipnet_create_if+1d1 ()
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0b80
ipnet:ipnet_plumb_ev+56 ()
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0bc0
ipnet:ipnet_nicevent_task+6b ()
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0c40
genunix:taskq_thread+193 ()
Dec 14 05:10:27 hostname genunix: [ID 655072 kern.notice] ffffff00104e0c50
unix:thread_start+8 ()
Dec 14 05:10:27 hostname unix: [ID 100000 kern.notice]
Dec 14 05:10:27 hostname genunix: [ID 672855 kern.notice] syncing file
systems...
Dec 14 05:10:27 hostname genunix: [ID 904073 kern.notice] done
Dec 14 05:10:28 hostname genunix: [ID 111219 kern.notice] dumping to
/dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
Dec 14 05:10:28 hostname ahci: [ID 405573 kern.info] NOTICE: ahci0:
ahci_tran_reset_dport port 0 reset port
Dec 14 05:11:07 hostname genunix: [ID 409368 kern.notice] ^M100% done: 460928
pages dumped, compression ratio 2.15,
Dec 14 05:11:07 hostname genunix: [ID 851671 kern.notice] dump succeeded
-----------------------------------------------------------------------------------------------------------
For all I know, maybe that is the normal response to trying to pull an "active"
network card.
I'm only running this "old" version because every version newer (snv_117-128a)
have broken Dell PERC5/i drivers/something else that causes huge problems for
that card.
If anyone can help me with this it'd be great. If the text dumps provided
aren't what you need, let me know.
Thanks
--
This message posted from opensolaris.org
_______________________________________________
networking-discuss mailing list
[email protected]